Searching a particular string pattern out of 10000 files in parallel

Question

Problem Statement:-

I need to search a particular String Pattern in around 10000 files and find the records in the files which contains that particular pattern. I can use grep here, but it is taking lots of time.

Below is the command I am using to search a particular string pattern after unzipping the dat.gz file

gzcat /data/newfolder/real-time-newdata/*_20120809_0_*.gz | grep 'b295ed051380a47a2f65fb75ff0d7aa7^]3^]-1'

If I simply count how many files are there after unzipping the above dat.gz file

gzcat /data/newfolder/real-time-newdata/*_20120809_0_*.gz | wc -l

I get around 10000 files. And I need to search the above string pattern in all these 10000 files and find out the records which contains the above String Pattern. And my above command is working fine but it is very very slow.

What is the best approach on this? Should we take 100 files at a time and search for the particular String Pattern in that 100 files parallelly.

Note:

I am running SunOS

bash-3.00$ uname -a
SunOS lvsaishdc3in0001 5.10 Generic_142901-02 i86pc i386 i86pc

Try running the same command against an unzipped file... The bottleneck may well be decompressing the file. — user229044, Aug 15 '12 at 23:51

score 2 · Accepted Answer · answered Aug 16 '12 at 00:10

2

Do NOT run this in parallel!!!! That's going to bounce the disk head all over the place, it will be much slower.

Since you are reading an archive file there's one way to get a substantial performance boost--don't write the results of the decompression out. The ideal answer would be to decompress to a stream in memory, if that's not viable then decompress to a ramdisk.

In any case you do want some parallelism here--one thread should be obtaining the data and then handing it off to another that does the search. That way you will either be waiting on the disk or on the core doing the decompressing, you won't waste any of that time doing the search.

(Note that in case of the ramdisk you will want to aggressively read the files it wrote and then kill them so the ramdisk doesn't fill up.)

answered Aug 16 '12 at 00:10

Loren Pechtel

8,945
3
33
45

Make Sense to me, but I cannot unzip the file and then look for the String Patter as all those files are 40 GB file and we have the space issues also. That is the reason I was looking for some other way of doing it. – arsenal Aug 16 '12 at 00:44
1

@loren I agree in principal but I don't think it should be assumed that the underlying disk hardware is a simple hard drive. For instance, the data could be striped, there could be multiple controllers etc. The amount of parallelism that makes sense can probably only be discovered empirically. – frankc Aug 16 '12 at 19:52
@frankc: Unless the underlying media is a SSD with reads equal to it's block size there will be seek issues. – Loren Pechtel Aug 16 '12 at 20:27
@TechGeeky: If you don't write the decompressed files out their size doesn't matter. Find a library that can read the files to memory (assuming they aren't too big. 10k files in 40gb of data means an average of 4mb/file, that won't be any problem unless the distribution is weird.) I have done decompress to memory back in the DOS days and on modern stuff I have written code that creates and sends a zip file without writing ANYTHING to disk. I would be surprised if there isn't a way to do it. – Loren Pechtel Aug 16 '12 at 20:32
@loren seek issues doesn't mean no speedup from parallelism, though. i think the level of parallelism that produces a speedup has to be discovered empirically. – frankc Aug 17 '12 at 15:34
@frankc: My experience is that whenever you try to run two disk operations in parallel they take longer than they would if run sequentially. – Loren Pechtel Aug 17 '12 at 20:33

score 0 · Answer 2 · answered Aug 16 '12 at 00:08

For starters, you will need to uncompress the file to disk.

This does work (in bash,) but you probably don't want to try to start 10,000 processes all at once. Run it inside the uncompressed directory:

for i in `find . -type f`; do ((grep 'b295ed051380a47a2f65fb75ff0d7aa7^]3^]-1' $i )&); done

So, we need to have a way to limit the number of spawned processes. This will loop as long as the number of grep processes running on the machine exceeds 10 (including the one doing the counting):

while [ `top -b -n1 | grep -c grep` -gt 10  ]; do echo true; done

I have run this, and it works.... but top takes so long to run that it effectively limits you to one grep per second. Can someone improve upon this, adding one to a count when a new process is started and decrementing by one when a process ends?

for i in `find . -type f`; do ((grep -l 'blah' $i)&); (while [ `top -b -n1 | grep -c grep` -gt 10 ]; do sleep 1; done); done

Any other ideas for how to determine when to sleep and when not to? Sorry for the partial solution, but I hope someone has the other bit you need.

score 0 · Answer 3 · answered Aug 16 '12 at 20:21

0

If you are not using regular expressions you can use the -F option of grep or use fgrep. This may provide you with additional performance.

answered Aug 16 '12 at 20:21

PaulB

357
2
2

score 0 · Answer 4 · answered Aug 16 '12 at 20:45

Your gzcat .... | wc -l does not indicate 10000 files, it indicates 10000 lines total for however many files there are.

This is the type of problem that xargs exists for. Assuming your version of gzip came with a script called gzgrep (or maybe just zgrep), you can do this:

find /data/newfolder/real-time-newdata -type f -name "*_20120809_0_*.gz" -print | xargs gzgrep

That will run one gzgrep command with batches of as many individual files as it can fit on a command line (there are options to xargs to limit how many, or for a number of other things). Unfortunately, gzgrep still has to uncompress each file and pass it off to grep, but there's not really any good way to avoid having to uncompress the whole corpus in order to search through it. Using xargs in this way will however cut down some on the overall number of new processes that need to be spawned.

Searching a particular string pattern out of 10000 files in parallel

4 Answers4