Introduction
This is an serial article about language analysis of ECS 236th meeting abstracts.
In this series, I’ve been explaining the technique used in my webapp ECS Meeting Explorer. The introduction of this app is available in an article below,
ECS Meeting Explorer – webapp for scientific conference
My previous article about data scraping is in following link,
Analysis of ECS 236th meeting abstracts(1) – data scraping with BeautifulSoup4
In this atricle, I will give an explanation of word embedding, vectorization of words used in all abstract text.
Preparation
In the series of article, I will use Python. Please install these libraries.
numpy > 1.14.5
pandas > 0.23.1
matplotlib > 2.2.2
beautifulsoup4 > 4.6.0
gensim > 3.4.0
scikit-learn > 0.19.1
scipy > 1.1.0
Before the analysis, please download all ECS 236th meeting abstracts from official site. Unzip and place it in same directory as jupyter-notebook.
Data scraping by BeautifulSoup4 was explained in my previous article, please check it before!
Word embedding by Word2Vec
Word2Vec (W2V) is a machine learning model used to produce word embedding, which is words mapping to vector space.
Word2Vec is a kind of unsupervised learning, therefore we don’t have to label training data. It is precious to me because it is a hard job at any time.
In this experiments, we use Word2Vec implemented in Gensim. So, we don’t have to make models by ourselves. Further information about Word2Vec are below,
models.word2vec – Word2vec embeddings(Gensim documentation)
Word2vec Tutorial | RARE Technologies
The original paper of word2vec.
Distributed Representations of Words and Phrases and their Compositionality
Now, we have a list contains detail of all abstract, title, authors, affiliations, session name, and contents as follows,
<span>></span> <span>dic_all</span><span>[{</span><span>'</span><span>num</span><span>'</span><span>:</span> <span>'</span><span>0001</span><span>'</span><span>,</span><span>'</span><span>title</span><span>'</span><span>:</span> <span>'</span><span>The Impact of Coal Mineral Matter (alumina and silica) on Carbon Electrooxidation in the Direct Carbon Fuel Cell</span><span>'</span><span>,</span><span>'</span><span>author</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>Simin Moradmanda</span><span>'</span><span>,</span> <span>'</span><span>Jessica A Allena</span><span>'</span><span>,</span> <span>'</span><span>Scott W Donnea</span><span>'</span><span>],</span><span>'</span><span>affiliation</span><span>'</span><span>:</span> <span>'</span><span>University of Newcastle</span><span>'</span><span>,</span><span>'</span><span>session</span><span>'</span><span>:</span> <span>'</span><span>A01</span><span>'</span><span>,</span><span>'</span><span>session_name</span><span>'</span><span>:</span> <span>'</span><span>Battery and Energy Technology Joint General Session</span><span>'</span><span>,</span><span>'</span><span>contents</span><span>'</span><span>:</span> <span>'</span><span>Direct carbon fuel cell DCFC as an electrochemical device...</span><span>'</span><span>,</span><span>'</span><span>mod_contents</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>direct</span><span>'</span><span>,</span><span>'</span><span>carbon</span><span>'</span><span>,</span><span>'</span><span>fuel</span><span>'</span><span>,</span><span>'</span><span>cell</span><span>'</span><span>,</span> <span>...</span> <span>,</span><span>'</span><span>melting</span><span>'</span><span>],</span><span>'</span><span>vector</span><span>'</span><span>:</span> <span>0</span><span>,</span><span>'</span><span>url</span><span>'</span><span>:</span> <span>'</span><span>1.html</span><span>'</span><span>},</span> <span>...</span> <span>]</span><span>></span> <span>dic_all</span> <span>[{</span><span>'</span><span>num</span><span>'</span><span>:</span> <span>'</span><span>0001</span><span>'</span><span>,</span> <span>'</span><span>title</span><span>'</span><span>:</span> <span>'</span><span>The Impact of Coal Mineral Matter (alumina and silica) on Carbon Electrooxidation in the Direct Carbon Fuel Cell</span><span>'</span><span>,</span> <span>'</span><span>author</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>Simin Moradmanda</span><span>'</span><span>,</span> <span>'</span><span>Jessica A Allena</span><span>'</span><span>,</span> <span>'</span><span>Scott W Donnea</span><span>'</span><span>],</span> <span>'</span><span>affiliation</span><span>'</span><span>:</span> <span>'</span><span>University of Newcastle</span><span>'</span><span>,</span> <span>'</span><span>session</span><span>'</span><span>:</span> <span>'</span><span>A01</span><span>'</span><span>,</span> <span>'</span><span>session_name</span><span>'</span><span>:</span> <span>'</span><span>Battery and Energy Technology Joint General Session</span><span>'</span><span>,</span> <span>'</span><span>contents</span><span>'</span><span>:</span> <span>'</span><span>Direct carbon fuel cell DCFC as an electrochemical device...</span><span>'</span><span>,</span> <span>'</span><span>mod_contents</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>direct</span><span>'</span><span>,</span><span>'</span><span>carbon</span><span>'</span><span>,</span><span>'</span><span>fuel</span><span>'</span><span>,</span><span>'</span><span>cell</span><span>'</span><span>,</span> <span>...</span> <span>,</span><span>'</span><span>melting</span><span>'</span><span>],</span> <span>'</span><span>vector</span><span>'</span><span>:</span> <span>0</span><span>,</span> <span>'</span><span>url</span><span>'</span><span>:</span> <span>'</span><span>1.html</span><span>'</span><span>},</span> <span>...</span> <span>]</span>> dic_all [{'num': '0001', 'title': 'The Impact of Coal Mineral Matter (alumina and silica) on Carbon Electrooxidation in the Direct Carbon Fuel Cell', 'author': ['Simin Moradmanda', 'Jessica A Allena', 'Scott W Donnea'], 'affiliation': 'University of Newcastle', 'session': 'A01', 'session_name': 'Battery and Energy Technology Joint General Session', 'contents': 'Direct carbon fuel cell DCFC as an electrochemical device...', 'mod_contents': ['direct','carbon','fuel','cell', ... ,'melting'], 'vector': 0, 'url': '1.html'}, ... ]
Enter fullscreen mode Exit fullscreen mode
Then, Let’s get lists of words modified for language analysis.
<span># make word list for W2V learning </span><span>docs</span> <span>=</span> <span>[</span><span>i</span><span>[</span><span>'</span><span>mod_contents</span><span>'</span><span>]</span> <span>for</span> <span>i</span> <span>in</span> <span>dic_all</span><span>]</span><span># make word list for W2V learning </span><span>docs</span> <span>=</span> <span>[</span><span>i</span><span>[</span><span>'</span><span>mod_contents</span><span>'</span><span>]</span> <span>for</span> <span>i</span> <span>in</span> <span>dic_all</span><span>]</span># make word list for W2V learning docs = [i['mod_contents'] for i in dic_all]
Enter fullscreen mode Exit fullscreen mode
This is a code for learning Word2Vec model. Only few lines!
<span>#Word2Vec model learning and save it. </span><span>from</span> <span>gensim.models.word2vec</span> <span>import</span> <span>Word2Vec</span><span>import</span> <span>logging</span><span>logging</span><span>.</span><span>basicConfig</span><span>(</span><span>format</span><span>=</span><span>'</span><span>%(asctime)s : %(levelname)s : %(message)s</span><span>'</span><span>,</span> <span>level</span><span>=</span><span>logging</span><span>.</span><span>INFO</span><span>)</span><span>model</span> <span>=</span> <span>Word2Vec</span><span>(</span><span>docs</span><span>,</span> <span>sg</span><span>=</span><span>1</span><span>,</span> <span>size</span><span>=</span><span>200</span><span>,</span> <span>window</span><span>=</span><span>5</span><span>,</span> <span>min_count</span><span>=</span><span>30</span><span>,</span> <span>workers</span><span>=</span><span>4</span><span>,</span> <span>sample</span><span>=</span><span>1e-6</span><span>,</span> <span>negative</span><span>=</span><span>5</span><span>,</span> <span>iter</span><span>=</span><span>1000</span><span>)</span><span>print</span><span>(</span><span>'</span><span>corpus = </span><span>'</span><span>,</span><span>model</span><span>.</span><span>corpus_count</span><span>)</span><span>#Word2Vec model learning and save it. </span><span>from</span> <span>gensim.models.word2vec</span> <span>import</span> <span>Word2Vec</span> <span>import</span> <span>logging</span> <span>logging</span><span>.</span><span>basicConfig</span><span>(</span><span>format</span><span>=</span><span>'</span><span>%(asctime)s : %(levelname)s : %(message)s</span><span>'</span><span>,</span> <span>level</span><span>=</span><span>logging</span><span>.</span><span>INFO</span><span>)</span> <span>model</span> <span>=</span> <span>Word2Vec</span><span>(</span><span>docs</span><span>,</span> <span>sg</span><span>=</span><span>1</span><span>,</span> <span>size</span><span>=</span><span>200</span><span>,</span> <span>window</span><span>=</span><span>5</span><span>,</span> <span>min_count</span><span>=</span><span>30</span><span>,</span> <span>workers</span><span>=</span><span>4</span><span>,</span> <span>sample</span><span>=</span><span>1e-6</span><span>,</span> <span>negative</span><span>=</span><span>5</span><span>,</span> <span>iter</span><span>=</span><span>1000</span><span>)</span> <span>print</span><span>(</span><span>'</span><span>corpus = </span><span>'</span><span>,</span><span>model</span><span>.</span><span>corpus_count</span><span>)</span>#Word2Vec model learning and save it. from gensim.models.word2vec import Word2Vec import logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) model = Word2Vec(docs, sg=1, size=200, window=5, min_count=30, workers=4, sample=1e-6, negative=5, iter=1000) print('corpus = ',model.corpus_count)
Enter fullscreen mode Exit fullscreen mode
The line 5 ‘model = Word2Vec(docs, …)’ corresponds to Word2Vec learning. The parameter ‘size‘ sets the dimension of word vectors, in this case, 200. Please see documents for the other parameters of this function.
After the learning, make a vocabulary and word vectors from Word2Vec model.
This vocabulary is saved as .npy file in same directory.
<span>#make word dictionary </span><span>vocab</span> <span>=</span> <span>[</span><span>i</span> <span>for</span> <span>i</span> <span>in</span> <span>model</span><span>.</span><span>wv</span><span>.</span><span>vocab</span><span>]</span><span>dictionary</span> <span>=</span> <span>{}</span><span>for</span> <span>n</span><span>,</span><span>j</span> <span>in</span> <span>enumerate</span><span>(</span><span>vocab</span><span>):</span><span>dictionary</span><span>[</span><span>j</span><span>]</span> <span>=</span> <span>n</span><span>np</span><span>.</span><span>save</span><span>(</span><span>'</span><span>dictionary.npy</span><span>'</span><span>,</span> <span>np</span><span>.</span><span>array</span><span>([</span><span>dictionary</span><span>]))</span><span>#make word vectors from the model. </span><span>word_vectors</span> <span>=</span> <span>[</span><span>model</span><span>.</span><span>wv</span><span>[</span><span>i</span><span>]</span> <span>for</span> <span>i</span> <span>in</span> <span>model</span><span>.</span><span>wv</span><span>.</span><span>vocab</span><span>]</span><span>word_vectors</span> <span>=</span> <span>np</span><span>.</span><span>array</span><span>(</span><span>word_vectors</span><span>)</span><span>#make word dictionary </span><span>vocab</span> <span>=</span> <span>[</span><span>i</span> <span>for</span> <span>i</span> <span>in</span> <span>model</span><span>.</span><span>wv</span><span>.</span><span>vocab</span><span>]</span> <span>dictionary</span> <span>=</span> <span>{}</span> <span>for</span> <span>n</span><span>,</span><span>j</span> <span>in</span> <span>enumerate</span><span>(</span><span>vocab</span><span>):</span> <span>dictionary</span><span>[</span><span>j</span><span>]</span> <span>=</span> <span>n</span> <span>np</span><span>.</span><span>save</span><span>(</span><span>'</span><span>dictionary.npy</span><span>'</span><span>,</span> <span>np</span><span>.</span><span>array</span><span>([</span><span>dictionary</span><span>]))</span> <span>#make word vectors from the model. </span><span>word_vectors</span> <span>=</span> <span>[</span><span>model</span><span>.</span><span>wv</span><span>[</span><span>i</span><span>]</span> <span>for</span> <span>i</span> <span>in</span> <span>model</span><span>.</span><span>wv</span><span>.</span><span>vocab</span><span>]</span> <span>word_vectors</span> <span>=</span> <span>np</span><span>.</span><span>array</span><span>(</span><span>word_vectors</span><span>)</span>#make word dictionary vocab = [i for i in model.wv.vocab] dictionary = {} for n,j in enumerate(vocab): dictionary[j] = n np.save('dictionary.npy', np.array([dictionary])) #make word vectors from the model. word_vectors = [model.wv[i] for i in model.wv.vocab] word_vectors = np.array(word_vectors)
Enter fullscreen mode Exit fullscreen mode
Now, we gets the words list and corresponding vectors.
In this vector space, similarity between words is expressed as a distance. We usually uses cosine distance for such a high-dimentional vector spaces.
The function for calculate a word similarity is below,
<span>def</span> <span>CalcSim</span><span>(</span><span>target</span><span>,</span> <span>vectors</span><span>,</span> <span>dictionary</span><span>):</span><span>target_vec</span> <span>=</span> <span>vectors</span><span>[</span><span>dictionary</span><span>[</span><span>target</span><span>]]</span><span>search_results</span> <span>=</span> <span>[]</span><span>for</span> <span>n</span><span>,</span><span>vector</span> <span>in</span> <span>enumerate</span><span>(</span><span>vectors</span><span>):</span><span>sim</span> <span>=</span> <span>cos_sim</span><span>(</span><span>target_vec</span><span>,</span><span>vector</span><span>)</span><span>result</span> <span>=</span> <span>{</span><span>'</span><span>num</span><span>'</span><span>:</span> <span>n</span><span>,</span> <span>'</span><span>value</span><span>'</span><span>:</span> <span>list</span><span>(</span><span>dictionary</span><span>.</span><span>keys</span><span>())[</span><span>n</span><span>],</span> <span>'</span><span>similarity</span><span>'</span><span>:</span> <span>sim</span><span>}</span><span>search_results</span><span>.</span><span>append</span><span>(</span><span>result</span><span>)</span><span>summary_pd</span> <span>=</span> <span>pd</span><span>.</span><span>io</span><span>.</span><span>json</span><span>.</span><span>json_normalize</span><span>(</span><span>search_results</span><span>)</span><span>summary_sorted</span> <span>=</span> <span>summary_pd</span><span>.</span><span>sort_values</span><span>(</span><span>'</span><span>similarity</span><span>'</span><span>,</span> <span>ascending</span><span>=</span><span>False</span><span>)</span><span>return</span> <span>summary_sorted</span><span>def</span> <span>CalcSim</span><span>(</span><span>target</span><span>,</span> <span>vectors</span><span>,</span> <span>dictionary</span><span>):</span> <span>target_vec</span> <span>=</span> <span>vectors</span><span>[</span><span>dictionary</span><span>[</span><span>target</span><span>]]</span> <span>search_results</span> <span>=</span> <span>[]</span> <span>for</span> <span>n</span><span>,</span><span>vector</span> <span>in</span> <span>enumerate</span><span>(</span><span>vectors</span><span>):</span> <span>sim</span> <span>=</span> <span>cos_sim</span><span>(</span><span>target_vec</span><span>,</span><span>vector</span><span>)</span> <span>result</span> <span>=</span> <span>{</span><span>'</span><span>num</span><span>'</span><span>:</span> <span>n</span><span>,</span> <span>'</span><span>value</span><span>'</span><span>:</span> <span>list</span><span>(</span><span>dictionary</span><span>.</span><span>keys</span><span>())[</span><span>n</span><span>],</span> <span>'</span><span>similarity</span><span>'</span><span>:</span> <span>sim</span><span>}</span> <span>search_results</span><span>.</span><span>append</span><span>(</span><span>result</span><span>)</span> <span>summary_pd</span> <span>=</span> <span>pd</span><span>.</span><span>io</span><span>.</span><span>json</span><span>.</span><span>json_normalize</span><span>(</span><span>search_results</span><span>)</span> <span>summary_sorted</span> <span>=</span> <span>summary_pd</span><span>.</span><span>sort_values</span><span>(</span><span>'</span><span>similarity</span><span>'</span><span>,</span> <span>ascending</span><span>=</span><span>False</span><span>)</span> <span>return</span> <span>summary_sorted</span>def CalcSim(target, vectors, dictionary): target_vec = vectors[dictionary[target]] search_results = [] for n,vector in enumerate(vectors): sim = cos_sim(target_vec,vector) result = {'num': n, 'value': list(dictionary.keys())[n], 'similarity': sim} search_results.append(result) summary_pd = pd.io.json.json_normalize(search_results) summary_sorted = summary_pd.sort_values('similarity', ascending=False) return summary_sorted
Enter fullscreen mode Exit fullscreen mode
Okay, Let’s search the similar words to ‘sustainable‘, recent buzz-words.
<span>target</span><span>=</span><span>'</span><span>sustainable</span><span>'</span><span>summary_sorted</span> <span>=</span> <span>CalcSim</span><span>(</span><span>target</span><span>,</span> <span>word_vectors</span><span>,</span> <span>dictionary</span><span>)</span><span>summary_sorted</span><span>[:</span><span>10</span><span>]</span><span>target</span><span>=</span><span>'</span><span>sustainable</span><span>'</span> <span>summary_sorted</span> <span>=</span> <span>CalcSim</span><span>(</span><span>target</span><span>,</span> <span>word_vectors</span><span>,</span> <span>dictionary</span><span>)</span> <span>summary_sorted</span><span>[:</span><span>10</span><span>]</span>target='sustainable' summary_sorted = CalcSim(target, word_vectors, dictionary) summary_sorted[:10]
Enter fullscreen mode Exit fullscreen mode
The result, top 10 words are shown like this,
num | similarity | value |
---|---|---|
588 | 1 | sustainable |
105 | 0.648442 | renewable |
100 | 0.552662 | energy |
1625 | 0.54727 | fuels |
862 | 0.541807 | efficient |
1624 | 0.53353 | fossil |
13 | 0.521877 | electricity |
607 | 0.480525 | technologies |
138 | 0.472065 | production |
108 | 0.471985 | wind |
The word most similar to ‘sustainable‘ is ‘renewable‘.
It’s satisfactory result, isn’t it?
2-dimentional visualization of word vectors
As I mentioned, the size of word vectors is 200.
It is impossible for human beings to imagine such a high dimensional data. Dimension reduction is needed for the visualization.
In this case, we will use Principal Component Analysis (PCA) from 200 to 100, and t-distributed Stochastic Neighbor Embedding (t-SNE) from 100 to 2. These methods are implemented in scikit-learn.
The function for dimension reduction is this,
<span>from</span> <span>sklearn.decomposition</span> <span>import</span> <span>IncrementalPCA</span><span>from</span> <span>sklearn.manifold</span> <span>import</span> <span>TSNE</span><span>def</span> <span>tsne_reduction</span><span>(</span><span>dataset</span><span>):</span><span>n</span> <span>=</span> <span>dataset</span><span>.</span><span>shape</span><span>[</span><span>0</span><span>]</span><span>batch_size</span> <span>=</span> <span>500</span><span>ipca</span> <span>=</span> <span>IncrementalPCA</span><span>(</span><span>n_components</span><span>=</span><span>100</span><span>)</span><span>for</span> <span>i</span> <span>in</span> <span>tqdm</span><span>(</span><span>range</span><span>(</span><span>n</span><span>//</span><span>batch_size</span><span>)):</span><span>r_dataset</span> <span>=</span> <span>ipca</span><span>.</span><span>partial_fit</span><span>(</span><span>dataset</span><span>[</span><span>i</span><span>*</span><span>batch_size</span><span>:(</span><span>i</span><span>+</span><span>1</span><span>)</span><span>*</span><span>batch_size</span><span>])</span><span>r_dataset</span> <span>=</span> <span>ipca</span><span>.</span><span>transform</span><span>(</span><span>dataset</span><span>)</span><span>r_tsne</span> <span>=</span> <span>TSNE</span><span>(</span><span>n_components</span><span>=</span><span>2</span><span>,</span> <span>random_state</span><span>=</span><span>0</span><span>,</span> <span>perplexity</span><span>=</span><span>50.0</span><span>,</span> <span>n_iter</span><span>=</span><span>3000</span><span>).</span><span>fit_transform</span><span>(</span><span>r_dataset</span><span>)</span><span>return</span><span>(</span><span>r_tsne</span><span>)</span><span>w2v_tsne</span> <span>=</span> <span>tsne_reduction</span><span>(</span><span>word_vectors</span><span>)</span><span>from</span> <span>sklearn.decomposition</span> <span>import</span> <span>IncrementalPCA</span> <span>from</span> <span>sklearn.manifold</span> <span>import</span> <span>TSNE</span> <span>def</span> <span>tsne_reduction</span><span>(</span><span>dataset</span><span>):</span> <span>n</span> <span>=</span> <span>dataset</span><span>.</span><span>shape</span><span>[</span><span>0</span><span>]</span> <span>batch_size</span> <span>=</span> <span>500</span> <span>ipca</span> <span>=</span> <span>IncrementalPCA</span><span>(</span><span>n_components</span><span>=</span><span>100</span><span>)</span> <span>for</span> <span>i</span> <span>in</span> <span>tqdm</span><span>(</span><span>range</span><span>(</span><span>n</span><span>//</span><span>batch_size</span><span>)):</span> <span>r_dataset</span> <span>=</span> <span>ipca</span><span>.</span><span>partial_fit</span><span>(</span><span>dataset</span><span>[</span><span>i</span><span>*</span><span>batch_size</span><span>:(</span><span>i</span><span>+</span><span>1</span><span>)</span><span>*</span><span>batch_size</span><span>])</span> <span>r_dataset</span> <span>=</span> <span>ipca</span><span>.</span><span>transform</span><span>(</span><span>dataset</span><span>)</span> <span>r_tsne</span> <span>=</span> <span>TSNE</span><span>(</span><span>n_components</span><span>=</span><span>2</span><span>,</span> <span>random_state</span><span>=</span><span>0</span><span>,</span> <span>perplexity</span><span>=</span><span>50.0</span><span>,</span> <span>n_iter</span><span>=</span><span>3000</span><span>).</span><span>fit_transform</span><span>(</span><span>r_dataset</span><span>)</span> <span>return</span><span>(</span><span>r_tsne</span><span>)</span> <span>w2v_tsne</span> <span>=</span> <span>tsne_reduction</span><span>(</span><span>word_vectors</span><span>)</span>from sklearn.decomposition import IncrementalPCA from sklearn.manifold import TSNE def tsne_reduction(dataset): n = dataset.shape[0] batch_size = 500 ipca = IncrementalPCA(n_components=100) for i in tqdm(range(n//batch_size)): r_dataset = ipca.partial_fit(dataset[i*batch_size:(i+1)*batch_size]) r_dataset = ipca.transform(dataset) r_tsne = TSNE(n_components=2, random_state=0, perplexity=50.0, n_iter=3000).fit_transform(r_dataset) return(r_tsne) w2v_tsne = tsne_reduction(word_vectors)
Enter fullscreen mode Exit fullscreen mode
Now, we can plot 2-dimensional word vectors.
Left shows the scatter plots of all word vectors, Right shows some highlighted points with corresponding words.
Precomputation of word-topic vectors by SCDV
We can estimate document vectors of abstract by averaging this word vectors with certain weight (such as tf-idf). But in this case, I will apply a method named SCDV: Sparse Composite Document Vectors to modify word vectors.
There are 2 steps for SCDV to build a document vector.
- Precomputation of word-topics vectors.
- Build sparse document vectors using word-topics vectors.
In this section, I will explain the former process.
This is a flow chart of computing word-topic vectors (image from here). It is divided into 3 process.
- Word vectors are classified into several clusters with soft clustering algorithms, which allows words to belong to every cluster with certain probability.
- Word-cluster vectors are made by multiplying vectors with the probability of belonging for each cluster.
- Concatenate all word-cluster vectors with idf (inverse document frequency) weighting to form word-topic vector.
This is a function to transform word vectors to word-topic vectors.
<span>def</span> <span>WordTopicVectors</span><span>(</span><span>word_vectors</span><span>)</span><span>#Gaussian Mixture Modelling </span> <span>num_clusters</span> <span>=</span> <span>30</span><span>clf</span> <span>=</span> <span>GaussianMixture</span><span>(</span><span>n_components</span><span>=</span><span>num_clusters</span><span>,</span><span>covariance_type</span><span>=</span><span>"</span><span>full</span><span>"</span><span>)</span><span>z_gmm</span> <span>=</span> <span>clf</span><span>.</span><span>fit</span><span>(</span><span>word_vectors</span><span>)</span><span>idx</span> <span>=</span> <span>clf</span><span>.</span><span>predict</span><span>(</span><span>word_vectors</span><span>)</span><span>idx_proba</span> <span>=</span> <span>clf</span><span>.</span><span>predict_proba</span><span>(</span><span>word_vectors</span><span>)</span><span>#Calculate word idf </span> <span>words</span> <span>=</span> <span>list</span><span>(</span><span>dictionary</span><span>.</span><span>keys</span><span>())</span><span>words</span> <span>=</span> <span>np</span><span>.</span><span>array</span><span>(</span><span>words</span><span>)</span><span>word_idf</span> <span>=</span> <span>np</span><span>.</span><span>zeros_like</span><span>(</span><span>words</span><span>,</span> <span>dtype</span><span>=</span><span>np</span><span>.</span><span>uint32</span><span>)</span><span>for</span> <span>doc</span> <span>in</span> <span>tqdm</span><span>(</span><span>docs</span><span>):</span><span>lim</span> <span>=</span> <span>len</span><span>(</span><span>doc</span><span>)</span><span>for</span> <span>w</span> <span>in</span> <span>doc</span><span>:</span><span>if</span><span>(</span><span>lim</span> <span>==</span> <span>0</span><span>):</span><span>break</span><span>else</span><span>:</span><span>idx</span> <span>=</span> <span>np</span><span>.</span><span>where</span><span>(</span><span>w</span> <span>==</span> <span>words</span><span>)</span><span>word_idf</span><span>[</span><span>idx</span><span>]</span> <span>+=</span> <span>1</span><span>lim</span> <span>-=</span> <span>1</span><span>word_counts</span> <span>=</span> <span>word_idf</span><span>word_idf</span> <span>=</span> <span>np</span><span>.</span><span>log</span><span>(</span><span>len</span><span>(</span><span>docs</span><span>)</span> <span>/</span> <span>word_idf</span><span>)</span> <span>+</span> <span>1</span><span>#Concatenate word vector with GMM cluster </span> <span>gmm_word_vectors</span> <span>=</span> <span>np</span><span>.</span><span>empty</span><span>((</span><span>word_vectors</span><span>.</span><span>shape</span><span>[</span><span>0</span><span>],</span> <span>word_vectors</span><span>.</span><span>shape</span><span>[</span><span>1</span><span>]</span> <span>*</span> <span>num_clusters</span><span>))</span><span>n</span> <span>=</span> <span>0</span><span>for</span> <span>vector</span><span>,</span><span>proba</span><span>,</span><span>idf</span> <span>in</span> <span>zip</span><span>(</span><span>word_vectors</span><span>,</span><span>idx_proba</span><span>,</span><span>word_idf</span><span>):</span><span>for</span> <span>m</span><span>,</span><span>p</span> <span>in</span> <span>enumerate</span><span>(</span><span>proba</span><span>):</span><span>if</span><span>(</span><span>m</span> <span>==</span> <span>0</span><span>):</span><span>cluster_vector</span> <span>=</span> <span>vector</span> <span>*</span> <span>p</span><span>else</span><span>:</span><span>cluster_vector</span> <span>=</span> <span>np</span><span>.</span><span>hstack</span><span>((</span><span>cluster_vector</span><span>,</span><span>vector</span> <span>*</span> <span>p</span><span>))</span><span>gmm_word_vectors</span><span>[</span><span>n</span><span>]</span> <span>=</span> <span>idf</span> <span>*</span> <span>cluster_vector</span><span>n</span> <span>+=</span> <span>1</span><span>return</span><span>(</span><span>gmm_word_vectors</span><span>)</span><span>#Calculate word-topic vectors </span><span>gmm_word_vectors</span> <span>=</span> <span>WordTopicVectors</span><span>(</span><span>word_vectors</span><span>)</span><span>def</span> <span>WordTopicVectors</span><span>(</span><span>word_vectors</span><span>)</span> <span>#Gaussian Mixture Modelling </span> <span>num_clusters</span> <span>=</span> <span>30</span> <span>clf</span> <span>=</span> <span>GaussianMixture</span><span>(</span><span>n_components</span><span>=</span><span>num_clusters</span><span>,</span><span>covariance_type</span><span>=</span><span>"</span><span>full</span><span>"</span><span>)</span> <span>z_gmm</span> <span>=</span> <span>clf</span><span>.</span><span>fit</span><span>(</span><span>word_vectors</span><span>)</span> <span>idx</span> <span>=</span> <span>clf</span><span>.</span><span>predict</span><span>(</span><span>word_vectors</span><span>)</span> <span>idx_proba</span> <span>=</span> <span>clf</span><span>.</span><span>predict_proba</span><span>(</span><span>word_vectors</span><span>)</span> <span>#Calculate word idf </span> <span>words</span> <span>=</span> <span>list</span><span>(</span><span>dictionary</span><span>.</span><span>keys</span><span>())</span> <span>words</span> <span>=</span> <span>np</span><span>.</span><span>array</span><span>(</span><span>words</span><span>)</span> <span>word_idf</span> <span>=</span> <span>np</span><span>.</span><span>zeros_like</span><span>(</span><span>words</span><span>,</span> <span>dtype</span><span>=</span><span>np</span><span>.</span><span>uint32</span><span>)</span> <span>for</span> <span>doc</span> <span>in</span> <span>tqdm</span><span>(</span><span>docs</span><span>):</span> <span>lim</span> <span>=</span> <span>len</span><span>(</span><span>doc</span><span>)</span> <span>for</span> <span>w</span> <span>in</span> <span>doc</span><span>:</span> <span>if</span><span>(</span><span>lim</span> <span>==</span> <span>0</span><span>):</span> <span>break</span> <span>else</span><span>:</span> <span>idx</span> <span>=</span> <span>np</span><span>.</span><span>where</span><span>(</span><span>w</span> <span>==</span> <span>words</span><span>)</span> <span>word_idf</span><span>[</span><span>idx</span><span>]</span> <span>+=</span> <span>1</span> <span>lim</span> <span>-=</span> <span>1</span> <span>word_counts</span> <span>=</span> <span>word_idf</span> <span>word_idf</span> <span>=</span> <span>np</span><span>.</span><span>log</span><span>(</span><span>len</span><span>(</span><span>docs</span><span>)</span> <span>/</span> <span>word_idf</span><span>)</span> <span>+</span> <span>1</span> <span>#Concatenate word vector with GMM cluster </span> <span>gmm_word_vectors</span> <span>=</span> <span>np</span><span>.</span><span>empty</span><span>((</span><span>word_vectors</span><span>.</span><span>shape</span><span>[</span><span>0</span><span>],</span> <span>word_vectors</span><span>.</span><span>shape</span><span>[</span><span>1</span><span>]</span> <span>*</span> <span>num_clusters</span><span>))</span> <span>n</span> <span>=</span> <span>0</span> <span>for</span> <span>vector</span><span>,</span><span>proba</span><span>,</span><span>idf</span> <span>in</span> <span>zip</span><span>(</span><span>word_vectors</span><span>,</span><span>idx_proba</span><span>,</span><span>word_idf</span><span>):</span> <span>for</span> <span>m</span><span>,</span><span>p</span> <span>in</span> <span>enumerate</span><span>(</span><span>proba</span><span>):</span> <span>if</span><span>(</span><span>m</span> <span>==</span> <span>0</span><span>):</span> <span>cluster_vector</span> <span>=</span> <span>vector</span> <span>*</span> <span>p</span> <span>else</span><span>:</span> <span>cluster_vector</span> <span>=</span> <span>np</span><span>.</span><span>hstack</span><span>((</span><span>cluster_vector</span><span>,</span><span>vector</span> <span>*</span> <span>p</span><span>))</span> <span>gmm_word_vectors</span><span>[</span><span>n</span><span>]</span> <span>=</span> <span>idf</span> <span>*</span> <span>cluster_vector</span> <span>n</span> <span>+=</span> <span>1</span> <span>return</span><span>(</span><span>gmm_word_vectors</span><span>)</span> <span>#Calculate word-topic vectors </span><span>gmm_word_vectors</span> <span>=</span> <span>WordTopicVectors</span><span>(</span><span>word_vectors</span><span>)</span>def WordTopicVectors(word_vectors) #Gaussian Mixture Modelling num_clusters = 30 clf = GaussianMixture(n_components=num_clusters,covariance_type="full") z_gmm = clf.fit(word_vectors) idx = clf.predict(word_vectors) idx_proba = clf.predict_proba(word_vectors) #Calculate word idf words = list(dictionary.keys()) words = np.array(words) word_idf = np.zeros_like(words, dtype=np.uint32) for doc in tqdm(docs): lim = len(doc) for w in doc: if(lim == 0): break else: idx = np.where(w == words) word_idf[idx] += 1 lim -= 1 word_counts = word_idf word_idf = np.log(len(docs) / word_idf) + 1 #Concatenate word vector with GMM cluster gmm_word_vectors = np.empty((word_vectors.shape[0], word_vectors.shape[1] * num_clusters)) n = 0 for vector,proba,idf in zip(word_vectors,idx_proba,word_idf): for m,p in enumerate(proba): if(m == 0): cluster_vector = vector * p else: cluster_vector = np.hstack((cluster_vector,vector * p)) gmm_word_vectors[n] = idf * cluster_vector n += 1 return(gmm_word_vectors) #Calculate word-topic vectors gmm_word_vectors = WordTopicVectors(word_vectors)
Enter fullscreen mode Exit fullscreen mode
In this function, we used gaussian mixture model for clustering. The number of cluster is recommended as 60 or higher in original paper, but now I choose 30 (because of memory issue for webapp).
The dimension of word-topic vectors will be 200(original word vector)×30(number of cluster) = 6000.
Then, visualize it with t-SNE dimension reduction!
Comparing to the word vectors by Word2Vec, The clusters for each words are separated clearly.
This means that these vectors well represent the relationship between words and topics.
Let’s see the details of each cluster and corresponding words.
This figure clearly shows that words of same topic belongs to the same cluster.
Conclusion
In this article, I demonstrated the word embedding by W2V and modification by SCDV.
I will explain about building document vector with this word-topic vectors!
原文链接:Analysis of ECS 236th meeting abstracts(2) – word embedding by Word2Vec and SCDV
暂无评论内容