3

What options are there for making word counts on very large files?

I believe the whole file is on 1 line, which may be part of the problem as pointed out in one of the answers below.

In this case, I have an xml file of 1.7 Gb and trying to count some things inside it quickly.

I found this post Count number of occurrences of a pattern in a file (even on same line) and the approach works for me up to a certain size.

Up to 300Mb or so (40 000) occurrences was fine doing

cat file.xml | grep -o xmltag | wc -l    

but above that size, I get "memory exhausted".

Community
  • 1
  • 1
user985366
  • 1,635
  • 3
  • 18
  • 39
  • 3
    did you try this `grep -o 'xmltag' file.xml | wc -l ` – Avinash Raj Jul 10 '14 at 07:31
  • Ref Raj note, you should not use `cat` with program that can read data itself. It slows program down and makes it more complicated. – Jotne Jul 10 '14 at 07:36
  • Yeah, somehow you need to split the file into chunks. Note that this splitting will likely be arbitrary, though, so your word count may be increased by (number of chunks - 1), due to one split word per chunk. – Hot Licks Jul 16 '14 at 15:57

3 Answers3

3

http://lists.gnu.org/archive/html/parallel/2014-07/msg00009.html

EXAMPLE: Grepping n lines for m regular expressions.

The simplest solution to grep a big file for a lot of regexps is:

grep -f regexps.txt bigfile

Or if the regexps are fixed strings:

grep -F -f regexps.txt bigfile

There are 2 limiting factors: CPU and disk I/O. CPU is easy to measure: If the grep takes >90% CPU (e.g. when running top), then the CPU is a limiting factor, and parallelization will speed this up. If not, then disk I/O is the limiting factor, and depending on the disk system it may be faster or slower to parallelize. The only way to know for certain is to measure.

If the CPU is the limiting factor parallelization should be done on the regexps:

cat regexp.txt | parallel --pipe -L1000 --round-robin grep -f - bigfile

This will start one grep per CPU and read bigfile one time per CPU, but as that is done in parallel, all reads except the first will be cached in RAM. Depending on the size of regexp.txt it may be faster to use --block 10m instead of -L1000. If regexp.txt is too big to fit in RAM, remove --round-robin and adjust -L1000. This will cause bigfile to be read more times.

Some storage systems perform better when reading multiple chunks in parallel. This is true for some RAID systems and for some network file systems. To parallelize the reading of bigfile:

parallel --pipepart --block 100M -a bigfile grep -f regexp.txt

This will split bigfile into 100MB chunks and run grep on each of these chunks. To parallelize both reading of bigfile and regexp.txt combine the two using --fifo:

parallel --pipepart --block 100M -a bigfile --fifo cat regexp.txt \| parallel --pipe -L1000 --round-robin grep -f - {}
Ole Tange
  • 31,768
  • 5
  • 86
  • 104
1

How many newlines are in your file.xml? If one of your lines is extremely long, that might explain why grep fails with "grep: memory exhausted".

A solution to that is to introduce \n at places, where it does not matter. Say, before every </:

cat big.xml | perl -e 'while(sysread(STDIN,$buf, 32768)){ $buf=~s:</:\n</:; syswrite(STDOUT,$buf); }'

GNU Parallel can chop the big file into smaller chunks. Again you will need to find good chopping places that are not in the middle of a match. For XML a good place will often be between > and <:

parallel -a big.xml --pipepart --recend '>' --recstart '<' --block 10M grep -o xmltag

Even better are end tags that represents end of a record:

parallel -a big.xml --pipepart --recend '</endrecord>' --block 10M grep -o xmltag

Note that --pipepart is a relatively new option, so you need version 20140622 or later.

Ole Tange
  • 31,768
  • 5
  • 86
  • 104
0

Try using GNU Parallel like this... it will split file.xml into chunks of 1MB (or thereabouts on nearest new line) and pass each chunk to one CPU core to run grep, so not only should it work, but it should work faster too:

parallel --pipe grep -o xmltag < file.xml | wc -l
Mark Setchell
  • 191,897
  • 31
  • 273
  • 432