2

I’m working on word embeddings and a little bit confused about number of word vector's dimensions. I mean, take word2vec as an example, my question is why we should use lets say 100 hidden neurons for our hidden layer? Does this number have any meaning or logic behind? or if it is arbitrary, why not 300? or 10? why not more or less? As we all know the simplest way to display vectors is on 2 dimensions space (only X and Y), why more dimensions? I read some resources about it and in one example they choose 100 dimensions, in the other they choose the other numbers like 150, 200, 80, etc.

I know the larger the number, the bigger the space for displaying relations between words, but we couldn't display relations on 2 dimensions vector space (only X and Y)?! why we need bigger space? each word is displayed by a vector so why we have to use high dimensional space when we can display vectors on 2 or 3 dimensions space? and then its more simple to use similarity techniques like cosine to find the similarities on 2 or 3 dimensions rather than 100 (from computation time viewpoint), right?

Shayan Zamani
  • 71
  • 1
  • 4

1 Answers1

2

Well.. If just displaying the vectors is your end game, you can use 2 or 3-dimensional vectors and it would just work the best.

Often in NLP, we have well-defined tasks like tagging, parsing, understanding the meanings, etc. For all of these purposes, higher dimensional vectors will ALWAYS perform better than 2-d, 3-d vectors. Because it has more degrees of freedoms to capture the relationships you are after. You can contain richer information through them.

its more simple to use similarity techniques like cosine to find the similarities on 2 or 3 dimensions rather than 100 (from computation time viewpoint), right?

No. This is saying adding 2 number is more simple than adding 100 numbers. The method (consine distance) is exactly the same.

aerin
  • 20,607
  • 28
  • 102
  • 140
  • I wanted to upvote, but isn't the 2nd half of your answer wrong? Running cosine distance on two vectors of size 300 is going to require about 100 times more CPU ops than cosine distance on two vectors of size 3. So, just like adding 300 numbers instead of adding 3 numbers, it will take 100 times more effort. (The OP was asking about computation time, so by "simpler" meant "quicker".) – Darren Cook Sep 02 '17 at 10:01
  • 1
    Haha.. then why do we comfortably use Big O notation? According to your logic, O(100N) is 100 times more effort than O(3N) :-) – aerin Sep 02 '17 at 16:44
  • To have enough room to learn the relationship between words, 200-400 dimension is a necessary evil. 3-dimension just won't be able to. – aerin Sep 02 '17 at 16:51
  • A single cosine similarity calculation is O(d), where d is the number of dimensions. Finding closest word in a set of N words, in d dimension space, is O(Nd), which you can _approximate_ as O(N) when N is much larger than d. But, the point I was making it that Shayan didn't ask about computational complexity, he asked about computation _time_. When d is 300 it takes 100 times more seconds than when d is 3. (On a single core, of course!) – Darren Cook Sep 02 '17 at 19:28
  • 1
    A single cosine similarity calculation is O(1), not O(d). The dimension is constant, at most 1000. If you increase it to more than 500, the performance deteriorates. Plus, the point of his question is not about the computation time. Read his question again. He's asking the reason why we aren't using smaller dimensions. – aerin Sep 03 '17 at 01:28
  • https://en.wikipedia.org/wiki/Cosine_similarity https://en.wikipedia.org/wiki/Big_O_notation (Agreed about the point of his question, and agree with the first half of your answer.) – Darren Cook Sep 03 '17 at 09:32