3

I'm using gawk to go through a large textual corpus (about 3-4GB, a compilation of ebooks) in order to print out every association of 3 words that appears at least 3 times, in order to produce linguistic statistics. Here is the code:

content of file.awk:

BEGIN { RS="[^[:alnum:]]+" } 

{ w1 = w2; w2 = w3; w3 = $0 } 

NR > 2 { count[tolower(w1 " " w2 " " w3)]++ } 

END { 
     for (phrase in count) {
         if (count[phrase] >= 3) { 
             print phrase, count[phrase] 
         } 
     } 
} 

command: gawk -f file.awk mytxtfile > output

It works fine with small files (a few hundreds MB) but I can't get it to work with files bigger than 1GB: gawk eats all my RAM (8GB) in less than a minute, then starts eating my swap and the whole system eventually freezes.

Do you know how I could optimize the code, even if it eventually takes much longer?

Thank you very much

bobylapointe
  • 663
  • 1
  • 5
  • 12
  • 7
    while an all gawk solution would be elegant, maybe unix pipe lines can help you, i.e. `awk '{print all 3 wrd sets}' | sort | uniq -c | awk '$1>2{print}'` or similar. Good luck. – shellter Jun 25 '12 at 23:22
  • 1
    Another approach would be to store the keys in a database, so you don't need to keep them in memory. This goes beyond what you can conveniently do with `awk`, though; but perhaps moving to e.g. Python would not be an insurmountable complication. – tripleee Jun 26 '12 at 05:45
  • That's not a bad idea, I'm going to give it a shot, thanks shellter – bobylapointe Jun 26 '12 at 05:49
  • bobylapointe, request you to please post sample Input_file and expected output. I am pretty sure we could help more by seeing that. – RavinderSingh13 Dec 23 '16 at 14:07
  • Maybe try the lighter-weight `mawk` instead? I believe the only change your code would need is `RS="[^a-zA-Z0-9]+"`. This will likely only work if you're just over the memory limit, but it's at least really easy to check. (Also, I really doubt `sort` will work given this size.) – Adam Katz Sep 18 '18 at 20:31

3 Answers3

0

As long as you need to retain information until the very end, your memory requirement is O(number of ordered 3-word combinations) — about 200K words means 8,000,000,000,000,000 combinations...

Even if your books' combined vocabulary is much smaller -- say, only 50K words -- that's still 50K^3 or 1.25*10^14. Then, even if your awk's implementation uses only 16 bytes per entry (impossible), that's still 2,000,000,000,000,000 bytes -- or 2000TB.

That's a worst-case scenario -- but you see, what orders of magnitude you are playing with.

Maybe, you don't need the words-combinations to be ordered? In that case, you reduce the number of array-entries 6-fold by sorting the words first. But I doubt, that would help you either...

Mikhail T.
  • 3,043
  • 3
  • 29
  • 46
0

Your solution isn't very efficient in terms of strings: it allocates one for every unique trigram, and in a large corpus, there are a lot of them. Instead you could set up a table with tree indices and do count[w1][w2][w3]++. That requires a bit of more work at the end, but now there's only one string per unique token.

If that's not good enough, you can always run your code on smaller groups of text, sort the output, and then merge them.

BTW, I guess your code is rudimentary, or are you foregoing things like end-of-sentence markers?

Shit, I'm answering a 6 year old question.

0

you mean something along these lines ?

pvE0 < "${m3l}"   \
                   \  
 | mawk '/^[\n-~]+$/*(NF=NF)' FS='\v' \
                   \ 
 | mawk2 'BEGIN { FS =  RS = "^$"
                 OFS = ORS =   "" 
   } END { 
           print (_=$(_<_))(_)(_)(_)(_)(_)(_) }' \
           \
    | pvE9 | mawk2 'BEGIN { 

            FS = "[^a-zA-Z0-9]+"
            RS = "^$" 
    } END { 
           for(_=(__=NF=NF)~"";_<NF;_++) { 
              
              if(!(____[___=($_)(OFS)$(_+1)(OFS)$(_+2)]++)) {
              
                   print ___,gsub(___,"&") } } }'

I don t 82348
don t drink 63
t drink coffee 28
drink coffee I 35
coffee I take 21
I take tea 28
take tea my 28
tea my dear 28
my dear I 140
dear I like 28
I like my 616
like my toast 28
my toast done 28
toast done on 28
done on one 7
on one side 14
one side And 7
side And you 140
And you can 1589

The downside with such a filter is that apostrophes get chopped off, so don't becomes don t

RARE Kpop Manifesto
  • 2,453
  • 3
  • 11