1

I have 3 milion abstracts and I would like to extract 4-grams from them. I want to build a language model so I need to find the frequencies of these 4-grams.

My problem is that I can't extract all these 4-grams in memory. How can I implement a system that it can estimate all frequencies for these 4-grams?

  • have you looked at hdf5 or pytables, as far as I know they connect well to numpy and are supposedly fast. – Magellan88 Sep 21 '16 at 10:25
  • Thank you for your feedback. I will check them – Dimitris Dimitriadis Sep 21 '16 at 10:37
  • Most 4-grams appear just once, so perhaps you can get the needed information by finding those that appear more than once. A key observation is that a 4-gram appears more than once if it extends a trigram which appears more than once, and such a trigram appears more than once if it extends a bigram which appears more than once. You can do things in stages. First find such bigrams (possibly feasible) and then find the trigrams and then finally the 4-grams. My answer for this question shows this idea for trigrams: http://stackoverflow.com/a/36935796/4996248 – John Coleman Sep 21 '16 at 10:54
  • Thank you for your feedback. It is a special information and I'll take into account this. – Dimitris Dimitriadis Sep 21 '16 at 11:04

1 Answers1

0

Sounds like you need to store the intermediate frequency counts on disk rather than in memory. Luckily most databases can do this, and python can talk to most databases.

Hardbyte
  • 1,467
  • 13
  • 25
  • I have already this in my mind but I don't know how efficient is. I thought to create for each 100000 articles a file sorted by the first word. Then I implemented merge sort in order to create a unique file and all of these 4grams sorted. – Dimitris Dimitriadis Sep 21 '16 at 10:44