3

In word2vec paper, they are using linear activation function. I reason may be that they are giving enough training data for learning word embeddings so that non linear activation function is not necessary, am I correct?

Also if we use non linear activation function in hidden layer then I think results should be better. So why google use linear activation function in case of word to vector?

Maxim
  • 52,561
  • 27
  • 155
  • 209
Azad
  • 71
  • 4

1 Answers1

2

It seems to me, most part of your confusion comes from thinking that their model is entirely linear. That's not true, because effectively there's always a softmax layer in the end. What is linear is everything that comes before that, and this is different from NNLM.

Remember that the main idea of all word representation methods is to predict the neighbor word, i.e. maximize the total conditional probability of the context by the center word (or vice versa):

probability model

So the objective function is bound to end with a final softmax layer (or the like). I encourage you to read this post for more details, it's pretty short and well-written.

You are right that the more non-linearity a neural network has, the more flexibility it gets and thus the better approximates the target distribution. In this case, they reason that additional flexibility doesn't pay off: in the end, they get a very good result much faster, which allows to scale this method to huge corpus, which in turn gives better results.

Side note: linear regression doesn't require training at all in order to find a solution, there is a close formula (there are technical difficulties with large matrices though).

Maxim
  • 52,561
  • 27
  • 155
  • 209