It seems to me, most part of your confusion comes from thinking that their model is entirely linear. That's not true, because effectively there's always a softmax layer in the end. What is linear is everything that comes before that, and this is different from NNLM.
Remember that the main idea of all word representation methods is to predict the neighbor word, i.e. maximize the total conditional probability of the context by the center word (or vice versa):

So the objective function is bound to end with a final softmax layer (or the like). I encourage you to read this post for more details, it's pretty short and well-written.
You are right that the more non-linearity a neural network has, the more flexibility it gets and thus the better approximates the target distribution. In this case, they reason that additional flexibility doesn't pay off: in the end, they get a very good result much faster, which allows to scale this method to huge corpus, which in turn gives better results.
Side note: linear regression doesn't require training at all in order to find a solution, there is a close formula (there are technical difficulties with large matrices though).