embedding and clustering a specific text (using GloVe)

Question

Edit 2: I thought better on my question and realized it was way to generalized and it is only a matter of something basic;

creating a new array from the Glove file (glove.6B.300d.txt) that contains ONLY the list of words that I have in my document.

I'm aware that this actually has nothing to do with this specific GloVe file and I should learn how to do it for any two lists of words...

I assume that I just don't know how properly look for this in order to learn how to execute this part. i.e what library to use/functions/buuzzwords I should look for.

Edit 1: I'm adding the code I used that works for the whole GloVe library;

from __future__ import division
from sklearn.cluster import KMeans
from numbers import Number
from pandas import DataFrame
import sys, codecs, numpy
class autovivify_list(dict):
  def __missing__(self, key):
     value = self[key] = []
     return value
  def __add__(self, x):

    if not self and isinstance(x, Number):
       return x
    raise ValueError
  def __sub__(self, x):

    if not self and isinstance(x, Number):
       return -1 * x
    raise ValueError
 def build_word_vector_matrix(vector_file, n_words):
   numpy_arrays = []
   labels_array = []
   with codecs.open(vector_file, 'r', 'utf-8') as f:
      for c, r in enumerate(f):
         sr = r.split()
         labels_array.append(sr[0])
         numpy_arrays.append( numpy.array([float(i) for i in sr[1:]]) )

         if c == n_words:
           return numpy.array( numpy_arrays ), labels_array

return numpy.array( numpy_arrays ), labels_array
def find_word_clusters(labels_array, cluster_labels):
  cluster_to_words = autovivify_list()
     for c, i in enumerate(cluster_labels):
     cluster_to_words[ i ].append( labels_array[c] )
  return cluster_to_words
if __name__ == "__main__":
   input_vector_file = 
   '/Users/.../Documents/GloVe/glove.6B/glove.6B.300d.txt'
   n_words = 1000 
   reduction_factor = 0.5
   n_clusters = int( n_words * reduction_factor ) 
   df, labels_array = build_word_vector_matrix(input_vector_file, 
   n_words)
   kmeans_model = KMeans(init='k-means++', n_clusters=n_clusters, 
   n_init=10)
   kmeans_model.fit(df)

   cluster_labels  = kmeans_model.labels_
   cluster_inertia   = kmeans_model.inertia_
   cluster_to_words  = find_word_clusters(labels_array, 
   cluster_labels)

   for c in cluster_to_words:
      print cluster_to_words[c]
      print "\n"

Original question:

Let's say I have a specific text (say of 500 words). I want to do the following:

Create an embedding of all the words in this text (i.e have a list of the GloVe vectors only of this 500 words)
Cluster it (*this I know how to do)

How do I do such a thing?

Please provide an MWE to show what you have tried and how exactly are you representing the Glove vectors. — Amrith Krishna, Sep 23 '19 at 13:06
Do you want a new embedding set from your words? Do you have only a text with 500 words? Are your words a list? Answering these can help you find better information. — Tiago Duque, Sep 23 '19 at 13:34
answering Tiago Duque: Yes, I have a text file of 500 words, I can make it a list of word, and I want an embedding just for these words. — hangug_wannabe, Sep 24 '19 at 15:21

score 0 · Answer 1 · answered Sep 23 '19 at 13:09

This is quite a straightforward problem. From your description, I infer that you have 500 words and you have the vectors available for them. I would suggest you head to Scikit learn library and apply one of the standard clustering approaches for the task. I would recommend starting with K-means. Use the following link to choose the right approach in Scikit-learn: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

embedding and clustering a specific text (using GloVe)

I assume that I just don't know how properly look for this in order to learn how to execute this part. i.e what library to use/functions/buuzzwords I should look for.

1 Answers1