2

While I was classifying and clustering the documents written in natural language, I came up with a question ...

As word2vec and glove, and or etc, vectorize the word in distributed spaces, I wonder if there are any method recommended or commonly used for document vectorization USING word vectors.

For example,

Document1: "If you chase two rabbits, you will lose them both."

can be vectorized as,

[0.1425, 0.2718, 0.8187, .... , 0.1011]

I know about the one also known as doc2vec, that this document has n dimensions just like word2vec. But this is 1 x n dimensions and I have been testing around to find out the limits of using doc2vec.

So, I want to know how other people apply the word vectors for applications with steady size.

Just stacking vectors with m words will be formed m x n dimensional vectors. In this case, the vector dimension will not be uniformed since dimension m will depends on the number of words in document.

If: [0.1018, ... , 0.8717]

you: [0.5182, ... , 0.8981]

..: [...]

m th word: [...]

And this form is not favorable form to run some machine learning algorithms such as CNN. What are the suggested methods to produce the document vectors in steady form using word vectors?

It would be great if it is provided with papers as well.

Thanks!

Isaac Sim
  • 539
  • 1
  • 7
  • 23

1 Answers1

4

The most simple approach to get a fixed-size vector from a text, when all you have is word-vectors, to average all the word-vectors together. (The vectors could be weighted, but if they haven't been unit-length-normalized, their raw magnitudes from training are somewhat of an indicator of their strength-of-single-meaning – polysemous/ambiguous words tend to have vectors with smaller magnitudes.) It works OK for many purposes.

Word vectors can be specifically trained to be better at composing like this, if the training texts are already associated with known classes. Facebook's FastText in its 'classification' mode does this; the word-vectors are optimized as much or more for predicting output classes of the texts they appear in, as they are for predicting their context-window neighbors (classic word2vec).

The 'Paragraph Vector' technique, often called 'doc2vec', gives every training text a sort-of floating pseudoword, that contributes to every prediction, and thus winds up with a word-vector-like position that may represent that full text, rather than the individual words/contexts.

There are many further variants, including some based on deeper predictive networks (eg 'Skip-thought Vectors'), or slightly different prediction targets (eg neighboring sentences in 'fastSent'), or other genericizations that can even include a mixture of symbolic and numeric inputs/targets during training (an option in Facebook's StarSpace, which explores other entity-vectorization possibilities related to word-vectors and FastText-like classification needs).

If you don't need to collapse a text to fixed-size vectors, but just compare texts, there are also techniques like "Word Mover's Distance" which take the "bag of word-vectors" for one text, and another, and give a similarity score.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Thanks for the full explanation around the natural language vectorization. But what you answered is not answering the fundamental problem of my question. I am sure that the one like 'doc2vec' or 'average of sum of word vectors' or even other methods are very useful, like you mentioned. But it compresses the document as 1 x n dimensions. For my case I think I need to look for the document with the word & character level vectors together as inputs for machine learning algorithm. And I wondered if there were any common methods to approach this problem; for example, – Isaac Sim May 09 '18 at 09:36
  • for example, stack of the word vectors can represents the document where the words belonged to. Only the problem is that the size of this method is the dimension of the document vector will depend on the number of the words. ..... Do you now understand what I want to know about? Or do you need further explanation ? – Isaac Sim May 09 '18 at 09:40
  • If you want to compare docs of different word-lengths, that tends to require, somewhere, coercing them both to a same-length representation. In the 'Skip-thought' example, a deep convolutional recurrent network takes variable-length input - each word individually - but then 'encodes' that to a fixed-length (1 x n) vector. And if you back up from the 'Word Mover's Distance' approach, it's calculating its distances based on (1 x V) vectors (V=vocab size), using added data from (V x n) (n=word-vector-dim) word-vectors. But yes, de facto 'compression' to a common-sized representation is typical. – gojomo May 10 '18 at 00:36
  • Thank you very much. That is the one I thought as the solution and you actually mentioned... Thank you – Isaac Sim May 11 '18 at 03:18