0

I'd need some help with a downsampling issue. I have to make a larger corpus (6 654 940 sentences, 19 592 258 tokens) comparable to a smaller one (15 607 sentences, 927 711 tokens), to implement them on 2 comparable word2vec models. Each corpus is a list of lists in which each list is a tokenized sentence: e.g. [['the', 'boy', 'eats']['the', 'teacher', 'explains']...]

I want to downsample the largest one to have the same number of tokens of the smaller one (keeping the original data structure: downsampling sentences until I get the desidered number of tokens). I am a very beginner of programming and I thought of two possible ways of proceeding but I am not sure how I can implement them: - downsampling the list of lists - downsampling the trained word2vec model (I saw in the forum that there is the parameter "sample" to downsampling the most frequent words but I want to get random sentences)

Can you help me out?

Thank you very much!! :)

chiaras15
  • 3
  • 2
  • Why do you need to shrink the larger corpus to a similar size? (Why not make the best model possible from each full corpus?) Your reason(s) for wanting to throw away a lot of data could affect what subsampling approaches would be appropriate. Similarly, will you want the shrunken-first-corpus to be the same size as the second-corpus in terms of count of sentences, or count of raw training words, or count of actual training words (after things like eliminating rare words), or count of final learned vocabulary-size? (Each would have a slightly-different approach, & slightly-different effects.) – gojomo Jan 23 '20 at 18:58
  • I want to train 2 models, one with the full corpus and the other with the first corpus whose size would be reduced to be the same as the second corpus (I'll discuss results in a project for my master's course). In this second case, I'd like the first-corpus to have the same number of raw training words as my second-corpus but still maintaining my structure of list of sentences of tokens. Does it make sense? – chiaras15 Jan 23 '20 at 19:47
  • Thanks, but **why** is that process of making the corpus-sizes match considered important? What **benefit** will it provide over the usually-optimal approach of using as much data as you can get your hands on & fit within your resource/time constraints? (I'll put some ideas for how to do what you're literally asking for in a formal answer, but without really knowing the reasons they may not be appropriate/optimal.) – gojomo Jan 24 '20 at 00:51
  • 1
    Thank you very much! I thought it was necessary to make all corpora that I want to compare with the same size.. I'm a very beginner – chiaras15 Jan 25 '20 at 00:00

1 Answers1

0

Let's label a few of the things you've mentioned explicitly:

corpus-A 6 654 940 sentences, 19 592 258 tokens (2.9 tokens per sentence)

corpus-B 15 607 sentences, 927 711 tokens (60 tokens per sentence)

I'll observe right away that the tiny average size of corpus-A sentences suggests they might not be the kind of natural-language-like runs-of-words against which word2vec is typically run. And, such clipped sentences may not give rise to the kinds of window-sized contexts that are most-typical for this kind of training. (The windows will be atypically small, no matter your choice of window. And note further than no training can happen from a sentence with a single token – it's a no-op.)

So, any scaling/sampling of corpus-A (with its sentences of aaround 3 tokens) is not, at the end of the process, going to be that much like corpus-B (with its more typical sentences of dozens to possibly hundreds of tokens). They won't really be alike, except in some singular measurement you choose to target.

If in fact you have enough memory to operate on corpus-A completely in RAM, then choosing a random subset of 15607 sentences – to match the sentence count of corpus-B, is very simple using standard Python functions:

import random
corpus_a_subset = random.sample(corpus_a, len(corpus_b))

Of course, this particular corpus_a_subset will only match the count of sentences in corpus_b, but in fact be much smaller in raw words – around 47k tokens long – given the much-shorter average size of corpus-A sentences.

If you were instead aiming for a roughly 927k-token-long subset to match the corpus-B token count, you'd need about (927k / 3 =) 309000 sentences:

corpus_a_subset = random.sample(corpus_a, 309000)

Still, while this should make corpus_a_subset closely match the raw word count of corpus_b, it's still likely a very-different corpus in terms of unique tokens, tokens' relative frequencies, and even the total number of training contexts – as the contexts with the shorter sentences will far more often be limited by sentence-end, than full window-length. (Despite the similarity in bottom-line token-count, the training times might be noticeably different, especially if your window is large.)

If you main interest were simply being able to train on corpus-A subsets as quickly as a smaller corpus, there are other ways besides discarding many of its sentences to slim it:

  • the sample parameter increases the rate at which occurrences of highly-frequent words are randomly skipped. In typical Zipfian word-frequencies, common words appear so many times, in all their possible varied usages, that it's safe to ignore many of them as redundant. And further, discarding many of those excessive examples, by allowing relatively more attention on rarer words, often improves the overall usefulness of the final word-vectors. Especially in very-large corpuses, picking a more aggressive (smaller) sample value can throw out lots of the corpus, speeding training, but still result in better vectors.

  • raising the min_count parameter discards ever-more of the less-frequent words. As opposed to any intuition that "more data is always better", this often improves the usefulness of the surviving words' vectors. That's because words with just a few usage examples tend not to get great vectors – those few examples won't show the variety & representativeness that's needed – yet the prevalence of so many such rare-but-insufficiently-demonstrated words still interferes with the training of other words.

As long as there are still enough examples of the more-frequent and important words, aggressive settings for sample and min_count, against a large corpus, may decrease the effective size by 90% or more – and still create high-quality vectors for the remaining words.

But also note: neither of your corpuses are quite as large as is best for word2vec training. It benefits a lot from large, varied corpuses. Your corpus-B, especially, is tiny compared to lots of word2vec work – and while you can somewhat 'stretch' a corpus's impact with more training epochs, and using smaller vectors or a smaller surviving vocabulary, you still may be below the corpus size where word2vec works best. So if at all possible, I'd be looking at ways to grow corpus-B, moreso than shrink corpus-A.

gojomo
  • 52,260
  • 14
  • 86
  • 115