A1-Exploring Word Vectors

实验概览

第一次实验非常简单,就是对 Word Vector 有一个初步的认识,引起我们探索的欲望。其实没什么好写的,权当记录一下学习过程

Part 1:基于计数的词向量

就是一个朴素的想法:"You shall know a word by the company it keeps",使用共现矩阵来求得词向量 (Co-Occurrence),思路就是取窗口大小 n,统计每一个单词上下文窗口中其它单词出现的次数

得到共现矩阵后,该矩阵的行(或列)提供了可以看作一种的词向量,但这种向量通常较大(维数为语料库中词的数目)。因此,我们的下一步应该进行降维操作

使用奇异值分解(SVD)方法,这是一种广义 PCA(主成分分析),以选择主要特征,如图所示:

综上,基于计数的词向量计算流程如下:

当语料库太过庞大时,对共现矩阵进行 SVD 是非常耗费时空的,利用共现矩阵比较稀疏的特点,一般选择截断 SVD

任务 1:获取单词表

要求从语料库中提取出单词列表,利用 Python 的 set 即可,非常简单

def distinct_words(corpus):
    """ Determine a list of distinct words for the corpus.
        Params:
            corpus (list of list of strings): corpus of documents
        Return:
            corpus_words (list of strings): sorted list of distinct words across the corpus
            n_corpus_words (integer): number of distinct words across the corpus
    """
    corpus_words = []
    n_corpus_words = -1
  
    # ------------------
    # Write your implementation here.
    corpus_words = sorted(list(set([j for i in corpus for j in set(i)])))
    n_corpus_words = len(corpus_words)

    # ------------------

    return corpus_words, n_corpus_words

任务 2:计算共现矩阵

要求计算出共现矩阵,遍历即可,注意索引细节

def compute_co_occurrence_matrix(corpus, window_size=4):
    """ Compute co-occurrence matrix for the given corpus and window_size (default of 4).
  
        Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
              number of co-occurring words.
  
              For example, if we take the document "<START> All that glitters is not gold <END>" with window size of 4,
              "All" will co-occur with "<START>", "that", "glitters", "is", and "not".
  
        Params:
            corpus (list of list of strings): corpus of documents
            window_size (int): size of context window
        Return:
            M (a symmetric numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): 
                Co-occurence matrix of word counts. 
                The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
            word2ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
    """
    words, n_words = distinct_words(corpus)
    M = None
    word2ind = {}  
  
    # ------------------
    # Write your implementation here.
    word2ind = {w : i for i, w in enumerate(words)}
    M = np.zeros((n_words, n_words))
    for s in corpus:
        index = [word2ind[i] for i in s]
        for i, wid in enumerate(index):
            left = max(i - window_size, 0)
            right = min(i + window_size, len(s))
            for j in index[left:i] + index[i + 1:right + 1]:
                M[wid][j] += 1
    # ------------------
    return M, word2ind

任务 3:降维

利用截断 SVD 得到 k 维的词向量矩阵

这里直接调 Scikit-Learn 的库 TruncatedSVD,注意结果要转置一下

def reduce_to_k_dim(M, k=2):
    """ Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)
        to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
            - http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
  
        Params:
            M (numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): co-occurence matrix of word counts
            k (int): embedding size of each word after dimension reduction
        Return:
            M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.
                    In terms of the SVD from math class, this actually returns U * S
    """  
    n_iters = 10     # Use this parameter in your call to `TruncatedSVD`
    M_reduced = None
    print("Running Truncated SVD over %i words..." % (M.shape[0]))
  
    # ------------------
    # Write your implementation here.
    svd = TruncatedSVD(n_components=k, n_iter=n_iters)
    M_reduced=svd.fit_transform(M)


    # ------------------
    print("Done.")
    return M_reduced

任务 4:实现画出词向量

如果取 k = 2,那么就得到二维词向量,就能在二维坐标中画出来了

def plot_embeddings(M_reduced, word2ind, words):
    """ Plot in a scatterplot the embeddings of the words specified in the list "words".
        NOTE: do not plot all the words listed in M_reduced / word2ind.
        Include a label next to each point.
  
        Params:
            M_reduced (numpy matrix of shape (number of unique words in the corpus , 2)): matrix of 2-dimensioal word embeddings
            word2ind (dict): dictionary that maps word to indices for matrix M
            words (list of strings): words whose embeddings we want to visualize
    """
    # ------------------
    # Write your implementation here.
    for w in words:
        index = word2ind[w]

        x = M_reduced[index][0]
        y = M_reduced[index][1]

        plt.scatter(x,y)
        plt.text(x, y, w)
    plt.show()

    # ------------------

Part2:基于词向量的预测

此次作业需要自己代码的部分已经结束了,第二部分只是看看基于计数的词向量能有什么用

开头我们提到了,这种基于上下文计数的词向量的基本思想,那么显然两个相似的词,它们的上下文也应该差不多,那么词向量也会很相似

如何评价词向量的相似程度呢,可以使用两向量间的夹角来刻画:

取其余弦值:

s=pq||p||||q||, where s[1,1]

该值越大则说明两向量夹角越小,则越相似

该部分内容就是利用这个相似性来求得同义词,反义词等,就不在文章中写了

需要注意的是,这种相似性其实并不能求得同义词,它更多的表现的是相同词性的词的词向量相似,比如 "large" 和 "small" 这一对词虽然是反义词,但是根据我们的直觉,它们的上下文其实应该是差不多的,故词向量也会非常相近,事实证明确实如此: