I'm using gawk to go through a large textual corpus (about 3-4GB, a compilation of ebooks) in order to print out every association of 3 words that appears at least 3 times, in order to produce linguistic statistics. Here is the code:
content of file.awk:
BEGIN { RS="[^[:alnum:]]+" }
{ w1 = w2; w2 = w3; w3 = $0 }
NR > 2 { count[tolower(w1 " " w2 " " w3)]++ }
END {
for (phrase in count) {
if (count[phrase] >= 3) {
print phrase, count[phrase]
}
}
}
command: gawk -f file.awk mytxtfile > output
It works fine with small files (a few hundreds MB) but I can't get it to work with files bigger than 1GB: gawk eats all my RAM (8GB) in less than a minute, then starts eating my swap and the whole system eventually freezes.
Do you know how I could optimize the code, even if it eventually takes much longer?
Thank you very much