What is the best way to find out which words are frequent near some word X? (note: NOT which words are most similar to word X)
I have GloVe word vectors, so each vector represents a distribution of some word across different environments (each dimension is an environment). So how do I retrieve words from each of those environments? In other words, how do I retrieve words that are similar in only one of the dimensions?
I tried looking for words that are closer to X along only one dimension, ignoring the rest, but that gave me garbage words.
P.S. What I so far is find the N nearest words (by cosine similarity) to word X, and then apply K-means clustering to those words. It works pretty good, but I am concerned that the N nearest words are not necessarily the words that appear NEAR word X, but rather, words that appear IN SIMILAR ENVIRONMENTS to word X.
EDIT: Clarification: simply collecting n-gram counts will not suffice, since I do am looking for a way to do this with only the vectors, that is, without access to the corpus itself. The reason is that some freely available pretrained vectors were trained on terrabytes of data. Storing the entire n-gram counts for common crawl, for example, would be very wasteful if this information could somehow be obtained from the pretrained vectors.