0

What is the best way to find out which words are frequent near some word X? (note: NOT which words are most similar to word X)

I have GloVe word vectors, so each vector represents a distribution of some word across different environments (each dimension is an environment). So how do I retrieve words from each of those environments? In other words, how do I retrieve words that are similar in only one of the dimensions?

I tried looking for words that are closer to X along only one dimension, ignoring the rest, but that gave me garbage words.

P.S. What I so far is find the N nearest words (by cosine similarity) to word X, and then apply K-means clustering to those words. It works pretty good, but I am concerned that the N nearest words are not necessarily the words that appear NEAR word X, but rather, words that appear IN SIMILAR ENVIRONMENTS to word X.

EDIT: Clarification: simply collecting n-gram counts will not suffice, since I do am looking for a way to do this with only the vectors, that is, without access to the corpus itself. The reason is that some freely available pretrained vectors were trained on terrabytes of data. Storing the entire n-gram counts for common crawl, for example, would be very wasteful if this information could somehow be obtained from the pretrained vectors.

2 Answers2

0

If what you really want is "which words appear near word X", you don't need the sort of 'dense' word-vectors from word2vec/glove at all. Just scan your corpus, and tally the co-occurrences (within your window-of-interest).

You'll then have exact counts, not some estimation from other indirectly-related representations.

(Search for resources related to [word co-occurence matrix] if you need more guidance on how to do such a tally.)

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Thanks. I am aware of that. However, what I want is to get that information out of trained word vectors, not out of the corpus. The reason is that I do not currently have the resources to scan the entirety of common crawl, so I am trying to figure out a way to use Stanford NLP's free common crawl vectors for that purpose. – MetaInstigator Oct 28 '17 at 17:23
  • I've not seen a way to reconstruct co-occurrences from dense vectors, so this may be in 'speculative research project' territory. And to test the validity of any conjectured method of predicting co-occurrences, you'd need to evaluate against... the real co-occurrences. (If you had a full trained word2vec NN, including hidden/output weights, it's plausible but inefficient to read its outputs as a probability-distribution of co-occurrences. But AFAIK the Stanford NLP GloVe vectors, like most pretrained vector sets, don't include the full NN model.) – gojomo Oct 28 '17 at 20:06
  • Another project built on Common Crawl is mentioned at the AWS public data sets page – https://aws.amazon.com/public-datasets/common-crawl/#N-gram_and_Language_Models – and might have a more manageable dataset (a tally of all 5-grams) that's still useful for your co-occurrence purposes. – gojomo Oct 28 '17 at 20:08
0

While I do think, the simple counting of cooccurrences will be better, you can do this with many of the embedding approaches, too.

Word2vec actually builds two mappings. An encoder, and a decoder.

We usually only use the encoder, and both vectors should be fairly similar. But for the purpose of finding cooccurring words, the obvious approach is to encode with the encoder, then find the most similar vectors in the decoder. because these model the context.

But beware: the hyped "neural word embeddings" really focus on substitutability. Which word could we use to substitute. So you are likely to see synonyms and such words first, then words with a similar context but other role.

With the simple counting based approaches you have better control over what they do: predict the probability of words occurring together.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194