Let's label a few of the things you've mentioned explicitly:
corpus-A 6 654 940 sentences, 19 592 258 tokens (2.9 tokens per sentence)
corpus-B 15 607 sentences, 927 711 tokens (60 tokens per sentence)
I'll observe right away that the tiny average size of corpus-A sentences suggests they might not be the kind of natural-language-like runs-of-words against which word2vec
is typically run. And, such clipped sentences may not give rise to the kinds of window
-sized contexts that are most-typical for this kind of training. (The windows will be atypically small, no matter your choice of window
. And note further than no training can happen from a sentence with a single token – it's a no-op.)
So, any scaling/sampling of corpus-A (with its sentences of aaround 3 tokens) is not, at the end of the process, going to be that much like corpus-B (with its more typical sentences of dozens to possibly hundreds of tokens). They won't really be alike, except in some singular measurement you choose to target.
If in fact you have enough memory to operate on corpus-A completely in RAM, then choosing a random subset of 15607 sentences – to match the sentence count of corpus-B, is very simple using standard Python functions:
import random
corpus_a_subset = random.sample(corpus_a, len(corpus_b))
Of course, this particular corpus_a_subset
will only match the count of sentences in corpus_b, but in fact be much smaller in raw words – around 47k tokens long – given the much-shorter average size of corpus-A sentences.
If you were instead aiming for a roughly 927k-token-long subset to match the corpus-B token count, you'd need about (927k / 3 =) 309000 sentences:
corpus_a_subset = random.sample(corpus_a, 309000)
Still, while this should make corpus_a_subset
closely match the raw word count of corpus_b
, it's still likely a very-different corpus in terms of unique tokens, tokens' relative frequencies, and even the total number of training contexts – as the contexts with the shorter sentences will far more often be limited by sentence-end, than full window
-length. (Despite the similarity in bottom-line token-count, the training times might be noticeably different, especially if your window
is large.)
If you main interest were simply being able to train on corpus-A subsets as quickly as a smaller corpus, there are other ways besides discarding many of its sentences to slim it:
the sample
parameter increases the rate at which occurrences of highly-frequent words are randomly skipped. In typical Zipfian word-frequencies, common words appear so many times, in all their possible varied usages, that it's safe to ignore many of them as redundant. And further, discarding many of those excessive examples, by allowing relatively more attention on rarer words, often improves the overall usefulness of the final word-vectors. Especially in very-large corpuses, picking a more aggressive (smaller) sample
value can throw out lots of the corpus, speeding training, but still result in better vectors.
raising the min_count
parameter discards ever-more of the less-frequent words. As opposed to any intuition that "more data is always better", this often improves the usefulness of the surviving words' vectors. That's because words with just a few usage examples tend not to get great vectors – those few examples won't show the variety & representativeness that's needed – yet the prevalence of so many such rare-but-insufficiently-demonstrated words still interferes with the training of other words.
As long as there are still enough examples of the more-frequent and important words, aggressive settings for sample
and min_count
, against a large corpus, may decrease the effective size by 90% or more – and still create high-quality vectors for the remaining words.
But also note: neither of your corpuses are quite as large as is best for word2vec
training. It benefits a lot from large, varied corpuses. Your corpus-B, especially, is tiny compared to lots of word2vec
work – and while you can somewhat 'stretch' a corpus's impact with more training epochs, and using smaller vectors or a smaller surviving vocabulary, you still may be below the corpus size where word2vec
works best. So if at all possible, I'd be looking at ways to grow corpus-B, moreso than shrink corpus-A.