I have a large number of fixed strings (~5 million) that i want to search in a a lot of files.
I saw that two of the most commonly used algorithms for string searching using finite set of patterns are: Aho-Corasick and Commentz-Walter.
My goal is to find an exact match and not patterns (it means that the list of strings is not regular expressions).
After some research, I found a lot of articles that states that Commentz-Walter tends to be faster than Aho-Corasick in real world scenarios (Article1,Article2) and it is also the algorithm behind GNU-grep.
I tried to use grep -F
also in the parallel way (taken from here):
free=$(awk '/^((Swap)?Cached|MemFree|Buffers):/ { sum += $2 }
END { print sum }' /proc/meminfo)
percpu=$((free / 200 / $(parallel --number-of-cores)))k
parallel --pipepart -a regexps.txt --block $percpu --compress \
grep -F -f - -n bigfile
and it seems that the problem is too big. I get this error:
grep: memory exhausted
- I thought about trying to split patterns list into a number of files and run grep number of times for the same file - but it seems clumsy. Is there any other solution? or I'm not running the grep in the correct way?
- By running the Commentz-Walter algorithm grep should do some pre-processing work. I assume that running grep with the same pattern file on two different files will cause grep executing the same pre-processing stage twice. Is there a way to run grep on a list of files and cause it to run the patterns pre-processing only once?
- Is there any good implementation of Commentz-Walter in c\c++? i only found code in python (here)?
--- Update ---
According to some comments, i tried to test the different Aho-Corasick c\c++ implementations (Komodia ,Cjgdev,chasan) non of them could have managed the 5 million pattern set example (all of them had memory issues (segmentation fault/ stack overflow)) - they do work on small sets. The example file was generated by this code:
with open(r"C:\very_large_pattern", 'w') as out:
for _ in range(0, 5000000):
out.write(str(uuid.uuid4()) + '\n')
Does anybody have a suggestion to an implementation that can handle those numbers?