N-grams - not in memory

Question

I have 3 milion abstracts and I would like to extract 4-grams from them. I want to build a language model so I need to find the frequencies of these 4-grams.

My problem is that I can't extract all these 4-grams in memory. How can I implement a system that it can estimate all frequencies for these 4-grams?

have you looked at hdf5 or pytables, as far as I know they connect well to numpy and are supposedly fast. — Magellan88, Sep 21 '16 at 10:25
Most 4-grams appear just once, so perhaps you can get the needed information by finding those that appear more than once. A key observation is that a 4-gram appears more than once if it extends a trigram which appears more than once, and such a trigram appears more than once if it extends a bigram which appears more than once. You can do things in stages. First find such bigrams (possibly feasible) and then find the trigrams and then finally the 4-grams. My answer for this question shows this idea for trigrams: http://stackoverflow.com/a/36935796/4996248 — John Coleman, Sep 21 '16 at 10:54
Thank you for your feedback. It is a special information and I'll take into account this. — Dimitris Dimitriadis, Sep 21 '16 at 11:04

score 0 · Answer 1 · answered Sep 21 '16 at 10:20

0

Sounds like you need to store the intermediate frequency counts on disk rather than in memory. Luckily most databases can do this, and python can talk to most databases.

answered Sep 21 '16 at 10:20

Hardbyte

1,467
13
25

I have already this in my mind but I don't know how efficient is. I thought to create for each 100000 articles a file sorted by the first word. Then I implemented merge sort in order to create a unique file and all of these 4grams sorted. – Dimitris Dimitriadis Sep 21 '16 at 10:44

N-grams - not in memory

1 Answers1