6

I'm training word2vec from scratch on 34 GB pre-processed MS_MARCO corpus(of 22 GB). (Preprocessed corpus is sentnecepiece tokenized and so its size is more) I'm training my word2vec model using following code :

from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec

class Corpus():
    """Iterate over sentences from the corpus."""
    def __init__(self):
        self.files = [
            "sp_cor1.txt",
            "sp_cor2.txt",
            "sp_cor3.txt",
            "sp_cor4.txt",
            "sp_cor5.txt",
            "sp_cor6.txt",
            "sp_cor7.txt",
            "sp_cor8.txt"
        ]

    def __iter__(self):
        for fname in self.files:
            for line in open(fname):
                words = line.split()
                yield words

sentences = Corpus()

model = Word2Vec(sentences, size=300, window=5, min_count=1, workers=8, sg=1, hs=1, negative=10)
model.save("word2vec.model")

My model is running now for about more than 30 hours now. This is doubtful since on my i5 laptop with 8 cores, I'm using all the 8 cores at 100% for every moment of time. Plus, my program seems to have read more than 100 GB of data from the disk now. I don't know if there is anything wrong here, but the main reason after my doubt on the training is because of this 100 GB of read from the disk. The whole corpus is of 34 GB, then why my code has read 100 GB of data from the disk? Does anyone know how much time should it take to train word2vec on 34 GB of text, with 8 cores of i5 CPU running all in parallel? Thank you. For more information, I'm also attaching the photo of my process from system monitor.

enter image description here

I want to know why my model has read 112 GB from memory, even when my corpus is of 34 GB in total? Will my training ever get finished? Also I'm bit worried about health of my laptop, since it is running constantly at its peak capacity since last 30 hours. It is really hot now. Should I add any additional parameter in Word2Vec for quicker training without much performance loss?

lux7
  • 1,600
  • 2
  • 18
  • 34
Ruchit Patel
  • 733
  • 1
  • 11
  • 26

1 Answers1

12

Completing a model requires one pass over all the data to discover the vocabulary, then multiple passes, with a default of 5, to perform vector training. So, you should expect to see about 6x your data size in disk-reads, just from the model training.

(If your machine winds up needing to use virtual-memory swapping during the process, there could be more disk activity – but you absolutely do not want that to happen, as the random-access pattern of word2vec training is nearly a worst-case for virtual memory usage, which will slow training immensely.)

If you'd like to understand the code's progress, and be able to estimate its completion time, you should enable Python logging to at least the INFO level. Various steps of the process will report interim results (such as the discovered and surviving vocabulary size) and estimated progress. You can often tell if something is going wrong before the end of a run by studying the logging outputs for sensible values, and once the 'training' phase has begun the completion time will be a simple projection from the training completed so far.

I believe most laptops should throttle their own CPU if it's becoming so hot as to become unsafe or risk extreme wear on the CPU/components, but whether yours does, I can't say, and definitely make sure its fans work & vents are unobstructed.

I'd suggest you choose some small random subset of your data – maybe 1GB? – to be able to run all your steps to completion, becoming familiar with the Word2Vec logging output, resource usage, and results, and tinkering with settings to observe changes, before trying to run on your full dataset, which might require days of training time.

Some of your shown parameters aren't optimal for speedy training. In particular:

  • min_count=1 retains every word seen in the corpus-survey, including those with only a single occurrence. This results in a much, much larger model - potentially risking a model that doesn't fit into RAM, forcing disastrous swapping. But also, words with just a few usage examples can't possibly get good word vectors, as the process requires seeing many subtly-varied alternate uses. Still, via typical 'Zipfian' word-frequencies, the number of such words with just a few uses may be very large in total, so retaining all those words takes a lot of training time/effort, and even serves a bit like 'noise' making the training of other words, with plenty of usage examples, less effective. So for model size, training speed, and quality of remaining vectors, a larger min_count is desirable. The default of min_count=5 is better for more projects than min_count=1 – this is a parameter that should only really be changed if you're sure you know the effects. And, when you have plentiful data – as with your 34GB – the min_count can go much higher to keep the model size manageable.

  • hs=1 should only be enabled if you want to use the 'hierarchical-softmax' training mode instead of 'negative-sampling' – and in that case, negative=0 should also be set to disable 'negative-sampling'. You probably don't want to use hierarchical-softmax: it's not the default for a reason, and it doesn't scale as well to larger datasets. But here you've enabled in in addition to negative-sampling, likely more-than-doubling the required training time.

  • Did you choose negative=10 because you had problems with the default negative=5? Because this non-default choice, again, would slow training noticeably. (But also, again, a non-default choice here would be more common with smaller datasets, while larger datasets like yours are more likely to experiment with a smaller negative value.)

The theme of the above observations is: "only change the defaults if you've already got something working, and you have a good theory (or way of testing) how that change might help".

With a large-enough dataset, there's another default parameter to consider changing to speed up training (& often improve word-vector quality, as well): sample, which controls how-aggressively highly-frequent words (with many redundant usage-examples) may be downsampled (randomly skipped).

The default value, sample=0.001 (aka 1e-03), is very conservative. A smaller value, such as sample=1e-05, will discard many-more of the most-frequent-words' redundant usage examples, speeding overall training considerably. (And, for a corpus of your size, you could eventually experiment with even smaller, more-aggressive values.)

Finally, to the extent all your data (for either a full run, or a subset run) can be in an already-space-delimited text file, you can use the corpus_file alternate method of specifying the corpus. Then, the Word2Vec class will use an optimized multithreaded IO approach to assign sections of the file to alternate worker threads – which, if you weren't previously seeing full saturation of all threads/CPU-cores, could increase our throughput. (I'd put this off until after trying other things, then check if your best setup still leaves some of your 8 threads often idle.)

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • 1
    First of all, thank you so much and oz for this excellent piece of answer. I wish I could give +10 up votes to your answer. About `min_count=1`, I'd like to clarify that, I've already pre-processed the MS-MARCO corpus using sentencepiece model, which has vocabulary size of 30000. So my corpus will also have max 30000 different tokens. – Ruchit Patel Mar 26 '20 at 05:28
  • I really did not know anything about `hs`(hierarchical softmax), but I saw it recommended somewhere in the gensim documentation. I used negative = 10 thinking that it will lead to more robust and accurate model. Can you tell a bit more about how the size of negative=10 will improve the robustness of the model compared to negative=5? Also now I don't think I should interrupt the training, because 145 GB of data has already been processed according to system monitor. – Ruchit Patel Mar 26 '20 at 05:38
  • Also in my case, I've used both hs=1, and negative=10. But you said that when hs=1, negative should be 0. What will happen in my case? My code seems inconsistent according to your statement. And lastly about providing `corpus_file`, my 8 cores are also working constantly at 100%. So I should not worry much about that right? – Ruchit Patel Mar 26 '20 at 05:42
  • 1
    `negative=10` will use more negative examples per positive example. It takes more time to train – perhaps about twice as long (though I don't recall exactly). It might help a bit, or might not, and if it does help it may not help as much as other options that would similarly take extra time. If it were an obviously better choice, it'd be the default – but most word2vec/similar libraries default to `negative=5`. I think some papers using smaller datasets have used much larger `negative` values, perhaps it helps a bit to "corpus stretch" in those cases, but you've already got a big corpus. – gojomo Mar 26 '20 at 06:32
  • 1
    `hs=1` means an output layer for training via the HS method is created, in addition to the output layer for the negative-sampling method. Negative sampling has a pretty easy-to-understand output layer: one node per word to predict. Hierarchical softmax instead encodes a single word as a set of output-node activations. It tends to get progressively slower with larger vocabularies. (Your 30k vocab isn't much of a worry here.) By enabling both, training essentially does a set of neg-sampling updates, *then* a set of HS updates, for each 'target' word, sharing the same 'input' word vectors. – gojomo Mar 26 '20 at 06:36
  • 1
    But generally, one mode or the other will be better. And usually, especially for larger vocabs/corpuses, negative-sampling should be used. Finally, yes, if you're sure all 8 threads and all 8 cores are saturated, there's no reason to use `corpus_file`. (But if after tweaking other settings, you see some threads/cores not-fully-saturated during training, you might get somewhat faster training with it instead. During the 1st vocab pass, neither mode will be able to saturate all cores, as that tally has some inherently single-threaded aspects.) – gojomo Mar 26 '20 at 06:38
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/210357/discussion-between-mike-patel-and-gojomo). – Ruchit Patel Mar 26 '20 at 08:00
  • 3
    I prefer to answer questions in the visible, findable answers/discussion, where it can also help others. – gojomo Mar 26 '20 at 19:24