0

I am training a model with gensim, my corpus is many short sentences, and each sentence has a frequency which indicates times it occurs in total corpus. I implement it as follow, as you can see, I just choose to do repeat freq times. Any way, if the data is small, it should work, but when data grows, the frequency can be very large, it costs too much memory and my machine cannot afford it.

So 1. can I just count the frequency in every record instead of repeat freq times? 2. Or any other ways to save memory?

class AddressSentences(object):
    def __init__(self, raw_path, path):
        self._path = path

    def __iter__(self):
        with open(self.path) as fi:
            headers = next(fi).split(",")
            i_address, i_freq = headers.index("address"), headers.index("freq")
            index = 0
            for line in fi:
                cols = line.strip().split(",")
                freq = cols[i_freq]
                address = cols[i_address].split()
                # Here I do repeat
                for i in range(int(freq)):
                    yield TaggedDocument(address, [index])
                index += 1

print("START %s" % datetime.datetime.now())
train_corpus = list(AddressSentences("/data/corpus.csv"))
model = gensim.models.doc2vec.Doc2Vec(size=50, min_count=2, iter=55)
model.build_vocab(train_corpus)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.iter)
print("END %s" % datetime.datetime.now())

corpus is something like this:

address,freq
Cecilia Chapman 711-2880 Nulla St.,1000
The Business Centre,1000
61 Wellfield Road,500
Celeste Slater 606-3727 Ullamcorper. Street,600
Theodore Lowe Azusa New York 39531,700
Kyla Olsen Ap #651-8679 Sodales Av.,300
roger
  • 9,063
  • 20
  • 72
  • 119
  • and what is your question exactly? – Matti Lyra Jun 12 '17 at 10:30
  • @MattiLyra sorry, I am just editing my questions – roger Jun 12 '17 at 10:33
  • So just to double check you want to `N` `TaggegDocuments` where `N` is equal to the last number on the line? What are you doing with the `TaggedDocuments`, the yield only creates a single one at a time, so you must be caching them somewhere else. Do they need to be separate objects? – Matti Lyra Jun 12 '17 at 10:49
  • @MattiLyra but `model.build_vocab` and `model.train` only accept `list`, so I repeat `N` times in the `list` – roger Jun 12 '17 at 10:52
  • The items in the list can be a reference to the same `object` they don't necessarily need to be separate object instances, which means that you're memory consumption is equal to 1 object, instead of 1000 for the first row. But if you can do that depends on what the downstream code does to those objects. – Matti Lyra Jun 12 '17 at 11:03
  • 1
    Also why do you need to repeat the same address `1000` times? – Matti Lyra Jun 12 '17 at 11:06

1 Answers1

0

Two options for your exact question:

(1)

You don't need to reify your corpus iterator into a fully in-memory list, with your line:

    train_corpus = list(AddressSentences("/data/corpus.csv"))

The gensim Word2Vec model can use your iterable-object directly as its corpus, since it implements __iter__() (and thus can be iterated over multiple times). So you can just do:

    train_corpus = AddressSentences("/data/corpus.csv")

Then each line will be read, and each repeated TaggedDocument re-yield()ed, without requiring the full set in memory.

(2)

Alternatively, in such cases you may sometimes just want to write a separate routine that takes your original file, and rather than directly yielding TaggedDocuments, does the repetition to create a tangible file on disk which includes the repetitions. Then, use a more simple iterable reader to stream that (already-repeated) dataset into your model.

A negative of this approach, in this particular case, is that it would increase the amount of (likely relatively laggy) disk-IO. However, if the special processing your iterator is doing is more costly – such as regex-based tokenization – this sort of process-and-rewrite can help avoid duplicate work by the model later. (The model needs to scan your corpus once for vocabulary-discovery, then again iter times for training – so any time-consuming work in your iterator will be done redundantly, and may be the bottleneck that keeps other training threads idle waiting for data.)

But after those two options, some Doc2Vec-specific warnings:

Repeating documents like this may not benefit the Doc2Vec model, as compared to simply iterating over the full diverse set. It's the tug-of-war interplay of contrasting examples which cause the word-vectors/doc-vectors in Word2Vec/Doc2Vec models to find useful relative arrangements.

Repeating exact documents/word-contexts is a plausible way to "overweight" those examples, but even if that's really what you want, and would help your end-goals, it'd be better to shuffle those repeats through the whole set.

Repeating one example consecutively is like applying the word-cooccurrences of that example like a jackhammer on the internal neural-network, without any chance for interleaved alternate examples to find a mutually-predictive weight arrangement. The iterative gradient-descent optimization through all diverse examples ideally works more like gradual water-driven erosion & re-deposition of values.

That suggests another possible reason to take the second approach, above: after writing the file-with-repeats, you could use an external line-shuffling tool (like sort -R or shuf on Linux) to shuffle the file. Then, the 1000 repeated lines of some examples would be evenly spread among all the other (repeated) examples, a friendlier arrangement for dense-vector learning.

In any case, I would try leaving out repetition entirely, or shuffling repetitions, and evaluate which steps are really helping on whatever the true end goal is.

gojomo
  • 52,260
  • 14
  • 86
  • 115