23

I am using Word2Vec with a dataset of roughly 11,000,000 tokens looking to do both word similarity (as part of synonym extraction for a downstream task) but I don't have a good sense of how many dimensions I should use with Word2Vec. Does anyone have a good heuristic for the range of dimensions to consider based on the number of tokens/sentences?

Vin Diesel
  • 331
  • 1
  • 3
  • 3
  • You can try with dimensions in the range of 100 like, 100,200,300. This is what has been proven to give good results. See http://arxiv.org/pdf/1301.3781.pdf – Irshad Bhat Oct 26 '14 at 17:21
  • I wonder if the results and bounds on sphere packing are relevant here https://gilkalai.wordpress.com/2016/03/23/a-breakthrough-by-maryna-viazovska-lead-to-the-long-awaited-solutions-for-the-densest-packing-problem-in-dimensions-8-and-24/ – arivero Apr 26 '16 at 23:22
  • I’m voting to close this question because it is not about programming as defined in the [help] but about ML theory and/or methodology - please see the intro and NOTE in https://stackoverflow.com/tags/machine-learning/info – desertnaut Apr 08 '22 at 07:49

3 Answers3

22

Typical interval is between 100-300. I would say you need at least 50D to achieve lowest accuracy. If you pick lesser number of dimensions, you will start to lose properties of high dimensional spaces. If training time is not a big deal for your application, i would stick with 200D dimensions as it gives nice features. Extreme accuracy can be obtained with 300D. After 300D word features won't improve dramatically, and training will be extremely slow.

I do not know theoretical explanation and strict bounds of dimension selection in high dimensional spaces (and there might not a application-independent explanation for that), but I would refer you to Pennington et. al, Figure2a where x axis shows vector dimension and y axis shows the accuracy obtained. That should provide empirical justification to above argument.

vincentmajor
  • 1,076
  • 12
  • 20
Cylonmath
  • 371
  • 1
  • 4
  • 1
    The reference "GloVe: Global Vectors forWord Representation" is not currently accesible in the link, but it is surely reachable elsewhere on the web. – arivero Apr 26 '16 at 23:31
  • This appears to be the version of record: http://www.aclweb.org/anthology/D14-1162 And here's a Scholar search for all versions of the paper: https://scholar.google.com/scholar?cluster=15824805022753088965&hl=en&as_sdt=0,47 – Dan Hicks Jul 09 '17 at 12:18
  • 1
    are there any 200d trained word2vec, i see we have glove with 200d, but can we use glove with word2vec ? – bicepjai Aug 16 '17 at 07:58
  • @Cyclonmath, Your saying that *If you pick lesser number of dimensions, you will start to lose properties of high dimensional spaces*" intrigues me. Do you have any expectation on what it would look like if we go all the way down to just 2D? I am trying to explain what I see, [the last image](https://stats.stackexchange.com/questions/337083/why-word-embeddings-learned-from-word2vec-are-linearly-correlated) is 2D embedding trained from text8. – zyxue Mar 29 '18 at 02:32
0

I think that the number of dimensions from word2vec depends on your application. The most empirical value is about 100. Then it can perform well.

Massimiliano Kraus
  • 3,638
  • 5
  • 27
  • 47
0

The number of dimensions reflects the over/under fitting. 100-300 dimensions is the common knowledge. Start with one number and check the accuracy of your testing set versus training set. The bigger the dimension size the easier it will be overfit on the training set and had bad performance on the test. Tuning this parameter is required in case you have high accuracy on training set and low accuracy on the testing set, this means that the dimension size is too big and reducing it might solve the overfitting problem of your model.

Ayman Salama
  • 419
  • 3
  • 10