A1-Exploring Word Vectors
实验概览
第一次实验非常简单,就是对 Word Vector 有一个初步的认识,引起我们探索的欲望。其实没什么好写的,权当记录一下学习过程
Part 1:基于计数的词向量
就是一个朴素的想法:"You shall know a word by the company it keeps",使用共现矩阵来求得词向量 (Co-Occurrence),思路就是取窗口大小
得到共现矩阵后,该矩阵的行(或列)提供了可以看作一种的词向量,但这种向量通常较大(维数为语料库中词的数目)。因此,我们的下一步应该进行降维操作
使用奇异值分解(SVD)方法,这是一种广义 PCA(主成分分析),以选择主要特征,如图所示:
综上,基于计数的词向量计算流程如下:
- 根据语料库得到共现矩阵
- 对共线矩阵进行奇异值分解降维
当语料库太过庞大时,对共现矩阵进行 SVD 是非常耗费时空的,利用共现矩阵比较稀疏的特点,一般选择截断 SVD
任务 1:获取单词表
要求从语料库中提取出单词列表,利用 Python 的 set 即可,非常简单
def distinct_words(corpus):
""" Determine a list of distinct words for the corpus.
Params:
corpus (list of list of strings): corpus of documents
Return:
corpus_words (list of strings): sorted list of distinct words across the corpus
n_corpus_words (integer): number of distinct words across the corpus
"""
corpus_words = []
n_corpus_words = -1
# ------------------
# Write your implementation here.
corpus_words = sorted(list(set([j for i in corpus for j in set(i)])))
n_corpus_words = len(corpus_words)
# ------------------
return corpus_words, n_corpus_words
任务 2:计算共现矩阵
要求计算出共现矩阵,遍历即可,注意索引细节
def compute_co_occurrence_matrix(corpus, window_size=4):
""" Compute co-occurrence matrix for the given corpus and window_size (default of 4).
Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
number of co-occurring words.
For example, if we take the document "<START> All that glitters is not gold <END>" with window size of 4,
"All" will co-occur with "<START>", "that", "glitters", "is", and "not".
Params:
corpus (list of list of strings): corpus of documents
window_size (int): size of context window
Return:
M (a symmetric numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)):
Co-occurence matrix of word counts.
The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
word2ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
"""
words, n_words = distinct_words(corpus)
M = None
word2ind = {}
# ------------------
# Write your implementation here.
word2ind = {w : i for i, w in enumerate(words)}
M = np.zeros((n_words, n_words))
for s in corpus:
index = [word2ind[i] for i in s]
for i, wid in enumerate(index):
left = max(i - window_size, 0)
right = min(i + window_size, len(s))
for j in index[left:i] + index[i + 1:right + 1]:
M[wid][j] += 1
# ------------------
return M, word2ind
任务 3:降维
利用截断 SVD 得到 k 维的词向量矩阵
这里直接调 Scikit-Learn 的库 TruncatedSVD,注意结果要转置一下
def reduce_to_k_dim(M, k=2):
""" Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)
to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
- http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
Params:
M (numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): co-occurence matrix of word counts
k (int): embedding size of each word after dimension reduction
Return:
M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.
In terms of the SVD from math class, this actually returns U * S
"""
n_iters = 10 # Use this parameter in your call to `TruncatedSVD`
M_reduced = None
print("Running Truncated SVD over %i words..." % (M.shape[0]))
# ------------------
# Write your implementation here.
svd = TruncatedSVD(n_components=k, n_iter=n_iters)
M_reduced=svd.fit_transform(M)
# ------------------
print("Done.")
return M_reduced
任务 4:实现画出词向量
如果取 k = 2,那么就得到二维词向量,就能在二维坐标中画出来了
def plot_embeddings(M_reduced, word2ind, words):
""" Plot in a scatterplot the embeddings of the words specified in the list "words".
NOTE: do not plot all the words listed in M_reduced / word2ind.
Include a label next to each point.
Params:
M_reduced (numpy matrix of shape (number of unique words in the corpus , 2)): matrix of 2-dimensioal word embeddings
word2ind (dict): dictionary that maps word to indices for matrix M
words (list of strings): words whose embeddings we want to visualize
"""
# ------------------
# Write your implementation here.
for w in words:
index = word2ind[w]
x = M_reduced[index][0]
y = M_reduced[index][1]
plt.scatter(x,y)
plt.text(x, y, w)
plt.show()
# ------------------
Part2:基于词向量的预测
此次作业需要自己代码的部分已经结束了,第二部分只是看看基于计数的词向量能有什么用
开头我们提到了,这种基于上下文计数的词向量的基本思想,那么显然两个相似的词,它们的上下文也应该差不多,那么词向量也会很相似
如何评价词向量的相似程度呢,可以使用两向量间的夹角来刻画:
取其余弦值:
该值越大则说明两向量夹角越小,则越相似
该部分内容就是利用这个相似性来求得同义词,反义词等,就不在文章中写了
需要注意的是,这种相似性其实并不能求得同义词,它更多的表现的是相同词性的词的词向量相似,比如 "large" 和 "small" 这一对词虽然是反义词,但是根据我们的直觉,它们的上下文其实应该是差不多的,故词向量也会非常相近,事实证明确实如此: