0

I would like to ask more about Word2Vec:

I am currently trying to build a program that check for the embedding vectors for a sentence. While at the same time, I also build a feature extraction using sci-kit learn to extract the lemma 0, lemma 1, lemma 2 from the sentence.

From my understanding;

1) Feature extractions : Lemma 0, lemma 1, lemma 2 2) Word embedding: vectors are embedded to each character (this can be achieved by using gensim word2vec(I have tried it))

More explanation:

Sentence = "I have a pen". Word = token of the sentence, for example, "have"

1) Feature extraction

"I have a pen" --> lemma 0:I, lemma_1: have, lemma_2:a.......lemma 0:have, lemma_1: a, lemma_2:pen and so on.. Then when try to extract the feature by using one_hot then will produce:

[[0,0,1],
[1,0,0],
[0,1,0]]

2) Word embedding(Word2vec)

"I have a pen" ---> "I", "have", "a", "pen"(tokenized) then word2vec from gensim will produced matrices for example if using window_size = 2 produced:

[[0.31235,0.31345],
[0.31235,0.31345],
[0.31235,0.31345],
[0.31235,0.31345],
[0.31235,0.31345]
]

The floating and integer numbers are for explanation purpose and original data should vary depending on the sentence. These are just dummy data to explain.*

Questions:

1) Is my understanding about Word2Vec correct? If yes, what is the difference between feature extraction and word2vec? 2) I am curious whether can I use word2vec to get the feature extraction embedding too since from my understanding, word2vec is only to find embedding for each word and not for the features.

Hopefully someone could help me in this.

JJson
  • 233
  • 1
  • 4
  • 18

1 Answers1

2

It's not completely clear what you're asking, as you seem to have many concepts mixed-up together. (Word2Vec gives vectors per word, not character; word-embeddings are a kind of feature-extraction on words, rather than an alternative to 'feature extraction'; etc. So: I doubt your understanding is yet correct.)

"Feature extraction" is a very general term, meaning any and all ways of taking your original data (such as a sentence) and creating a numerical representation that's good for other kinds of calculation or downstream machine-learning.

One simple way to turn a corpus of sentences into numerical data is to use a "one-hot" encoding of which words appear in each sentence. For example, if you have the two sentences...

['A', 'pen', 'will', 'need', 'ink']
['I', 'have', 'a', 'pen']

...then you have 7 unique case-flattened words...

['a', 'pen', 'will', 'need', 'ink', 'i', 'have']

...and you could "one-hot" the two sentences as a 1-or-0 for each word they contain, and thus get the 7-dimensional vectors:

 [1, 1, 1, 1, 1, 0, 0]  # A pen will need ink
 [1, 1, 0, 0, 0, 1, 1]  # I have a pen

Even with this simple encoding, you can now compare sentences mathematically: a euclidean-distance or cosine-distance calculation between those two vectors will give you a summary distance number, and sentences with no shared words will have a high 'distance', and those with many shared words will have a small 'distance'.

Other very-similar possible alternative feature-encodings of these sentences might involve counts of each word (if a word appeared more than once, a number higher than 1 could appear), or weighted-counts (where words get an extra significance factor by some measure, such as the common "TF/IDF" calculation, and thus values scaled to be anywhere from 0.0 to values higher than 1.0).

Note that you can't encode a single sentence as a vector that's just as wide as its own words, such as "I have a pen" into a 4-dimensional [1, 1, 1, 1] vector. That then isn't comparable to any other sentence. They all need to be converted to the same-dimensional-size vector, and in "one hot" (or other simple "bag of words") encodings, that vector is of dimensionality equal to the total vocabulary known among all sentences.

Word2Vec is a way to turn individual words into "dense" embeddings with fewer dimensions but many non-zero floating-point values in those dimensions. This is instead of sparse embeddings, which have many dimensions that are mostly zero. The 7-dimensional sparse embedding of 'pen' alone from above would be:

[0, 1, 0, 0, 0, 0, 0]  # 'pen'

If you trained a 2-dimensional Word2Vec model, it might instead have a dense embedding like:

[0.236, -0.711]  # 'pen'

All the 7 words would have their own 2-dimensional dense embeddings. For example (all values made up):

[-0.101, 0.271]   # 'a'
[0.236, -0.711]   # 'pen'
[0.302, 0.293]    # 'will'
[0.672, -0.026]   # 'need'
[-0.198, -0.203]  # 'ink'
[0.734, -0.345]   # 'i'
[0.288, -0.549]   # 'have'

If you have Word2Vec vectors, then one alternative simple way to make a vector for a longer text, like a sentence, is to average together all the word-vectors for the words in the sentence. So, instead of a 7-dimensional sparse vector for the sentence, like:

[1, 1, 0, 0, 0, 1, 1]  # I have a pen

...you'd get a single 2-dimensional dense vector like:

[ 0.28925, -0.3335 ]  # I have a pen

And again different sentences may be usefully comparable to each other based on these dense-embedding features, by distance. Or these might work well as training data for a downstream machine-learning process.

So, this is a form of "feature extraction" that uses Word2Vec instead of simple word-counts. There are many other more sophisticated ways to turn text into vectors; they could all count as kinds of "feature extraction".

Which works best for your needs will depend on your data and ultimate goals. Often the most-simple techniques work best, especially once you have a lot of data. But there are few absolute certainties, and you often need to just try many alternatives, and test how well they do in some quantitative, repeatable scoring evaluation, to find which is best for your project.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Thank you so much for the explanations on Word2Vec. Now I understand it, so the vectors for each word are taken in average when finding a sentence vectors, right? Another one more thing that I would like to confirm and to understand, since when using word2vec, the text is converted to sparse embedding, then the 2 dimensional vectors and 3-dimensional vectors(depending on the adjustment made by us based on the window_size) – JJson Sep 19 '18 at 01:13
  • such as [-0.101, 0.271] and [-0.101, 0.271, 0.288] respectively, are used to represent only a word and the non-floating numbers within these list of float numbers bring no meaning when checking individually, right? For example, [-0.101, 0.271] #a.... the "-0.101" is a sparse space within the dimension and it has no specific meaning to it if it is not together with "0.271"? And in order to find the most common word, it will be compared by using cosine distance to find the next common word? Am I getting it right? Thank you and sorry for the trouble. I started it last week so need time to learn – JJson Sep 19 '18 at 01:14
  • When I try to use gensim to train a model for word2vec. For example, model = gensim.models.word2vec.Word2Vec(sentence, size=25, window=5, min_count=1) and the sentence = [["I"], ["have"], ["a"], ["pen"]]. The output for vectors ["a"] gave me in a dimension of 25 like I set it to be but in one_hot vector the dimension means the unique words of the sentences. Somehow, here i became very confused, why I can set it to be 25 dimension and it really output me with 25 dimensions. Then this dimension represent which word for each column? In one hot each column represent each unique word. – JJson Sep 19 '18 at 06:42
  • 1
    Averaging the word-vectors is one possible way to make a vector for a text. There are many other ways. – gojomo Sep 19 '18 at 13:56
  • For Word2Vec, depending on the mode, there's an internal stage that's a little bit like a "sparse" encoding of the input word or input context. But it's not literally that. – gojomo Sep 19 '18 at 13:58
  • 1
    They're not "2-dimensional" or "3-dimensional" vectors, but whatever `size` you've specified – usually 100 dimensions or more, for real datasets. The vector dimensionality has nothing to do with the `window` size. (And I don't recommend doing much experimentation with tiny toy-sized datasets of just a few or few dozen sentences – that's so unlike what happens with real data it can introduce its own problems. – gojomo Sep 19 '18 at 14:00
  • 1
    No, none of the example coordinates in my `[-0.101, 0.271] ` vector represent a 'sparse space'. Those are coordinates in a dense embedding, with fewer dimensions and few (essentially none) that are 0.0. You are correct that within such a dense embedding, the individual coordinates like `-0.101` don't have any exactly-interpretable meaning. They're only meaningful in combination with all other coordinates, as a rough direction in the N-dimensional space, and in comparison to other vectors. – gojomo Sep 19 '18 at 14:03
  • 1
    You can't ask the `Word2Vec` mdoel to report "one-hot" vectors, for either words or tuns-of-text, because that's not its purpose and as noted above, it doesn't literally create a one-hot vector at any point of its training. (There's a stage that's a *little* like one-hot encoding of words, but it really just looks up the dense-vector for a word at a certain position in an array the size of the vocabulary. It doesn't even instantiate the one-hot vector.) – gojomo Sep 19 '18 at 14:07
  • 1
    Thank you so much for clarifying and explaining everything to me. I understand the basics of Word2vec now. I will keep on trying big dataset to get to know more about Word2vec. I am really glad that you are willing to spend your time explaining it to me. This really saves lots of my time searching and reading from the paper and also doing implementations to understand it. Thank you again. – JJson Sep 20 '18 at 00:09
  • A short question, does the size of the dimension matters? Since the dimension itself is the coordinates of each word to find the distance to the next word, should we just minimize the dimension so that it will be a small size of dimension and contribute to nearer distance between words. Bigger dimension is only good when we have thousands or billions of training data since the words will be huge and a sparse dimension will make the finding on nearer words to be feasible. – JJson Sep 20 '18 at 01:27
  • 1
    Yes, the size of the dimensions matter. Common values for word-vector dimensionality range from 100-1000, with 300-400 being especially common. But there's no one best size: good projects will test multiple values, given their data and goals. And you need a lot of data to create good larger vectors. (Google's popular 'GoogleNews'-trained 300-dimensional vectors were trained on billions of words of text.) – gojomo Sep 20 '18 at 07:02
  • Thank you again, sir. Alright, I will try to download some large datasets and do more test to understand it. – JJson Sep 20 '18 at 07:22
  • Hi sir, if you could spare some time to share me more knowledge about my current built code on doc2vec then would be good too. Is it alright for you to see whether my understanding is correct or wrong? https://stackoverflow.com/questions/52436762/doc2vec-output-data-for-only-a-single-document-and-not-two-documents-vectors – JJson Sep 21 '18 at 04:49