searching static list of fixed strings in a huge number of files

Question

I have a large number of fixed strings (~5 million) that i want to search in a a lot of files.

I saw that two of the most commonly used algorithms for string searching using finite set of patterns are: Aho-Corasick and Commentz-Walter.

My goal is to find an exact match and not patterns (it means that the list of strings is not regular expressions).

After some research, I found a lot of articles that states that Commentz-Walter tends to be faster than Aho-Corasick in real world scenarios (Article1,Article2) and it is also the algorithm behind GNU-grep.

I tried to use grep -F also in the parallel way (taken from here):

free=$(awk '/^((Swap)?Cached|MemFree|Buffers):/ { sum += $2 }
          END { print sum }' /proc/meminfo)
percpu=$((free / 200 / $(parallel --number-of-cores)))k
parallel --pipepart -a regexps.txt --block $percpu --compress \
grep -F -f - -n bigfile

and it seems that the problem is too big. I get this error:

grep: memory exhausted

I thought about trying to split patterns list into a number of files and run grep number of times for the same file - but it seems clumsy. Is there any other solution? or I'm not running the grep in the correct way?
By running the Commentz-Walter algorithm grep should do some pre-processing work. I assume that running grep with the same pattern file on two different files will cause grep executing the same pre-processing stage twice. Is there a way to run grep on a list of files and cause it to run the patterns pre-processing only once?
Is there any good implementation of Commentz-Walter in c\c++? i only found code in python (here)?

--- Update ---

According to some comments, i tried to test the different Aho-Corasick c\c++ implementations (Komodia ,Cjgdev,chasan) non of them could have managed the 5 million pattern set example (all of them had memory issues (segmentation fault/ stack overflow)) - they do work on small sets. The example file was generated by this code:

with open(r"C:\very_large_pattern", 'w') as out:
for _ in range(0, 5000000):
    out.write(str(uuid.uuid4()) + '\n')

Does anybody have a suggestion to an implementation that can handle those numbers?

Can you post some samples from the "fixed strings" and the "files"? — James Brown, May 09 '18 at 15:00
yeah some examples content of input list and input files will help.. for ex, matching whole line, word, partial match, etc — Sundeep, May 09 '18 at 15:21
The first article says that CW is faster than AC, but doesn't provide any data. The second article presents a graph (Fig. 8) that it says shows time, but the Y axis is labeled "Memory occupation." So I'm not fully convinced. That said, the Aho-Corasick algorithm is easy to implement. I used a pretty naive implementation that managed to match incoming network traffic in real time against 10 million strings. It worked quite well. I would suggest using one of the many available Aho-Corasick implementations, which you can get running an an afternoon. If that's too slow, then look for C-W. — Jim Mischel, May 09 '18 at 21:56
Its a part of a large propriety backup system. The fix strings are hashes of files with their specific metadata (like modified date and etc) and i need to find those strings in log files (which are a lot of times are un-formatted - assuming to find in a clear text), so what i need is an exact match (the pattern is not a regular expression) — sborpo, May 10 '18 at 05:03
The mischasan A-C implementation claims 2-3 bytes per pattern byte, as well as a data structure that can be shared with IPC mechanisms, so it sounds ideal and should work in theory. I would first of all contact the author, and then try to fix it myself (perhaps all you need to do is increase the default stack size with a linker switch). Failing that -- file hashes have a very distinctive pattern (`[0-9a-f]{32}` or similar), so if these log files don't contain numerous hashes *of other things*, just parse them all out and either (a) look up each in hashtable of patterns or (b) sort and merge. — j_random_hacker, May 10 '18 at 16:05
Even if the files are unformatted, the word to search for are probably isolated (surrounded by whitespace or Nonword-characters). So you would not need a string search, you just could tokenize the files and check if the token is in the set of 5M strings. If performance is an issue you should try out the Backward-Oracle-Matching algorithm, it is more efficient particularly on large patterns and large size of patterns (yet I do not know about memory consumption). — CoronA, May 14 '18 at 04:29
I could imagine if all search strings are of the same length (hashes I guess are), then this could be a relevant factor for optimizing the algorithm. — Alfe, Mar 26 '19 at 12:56
Have you tried [ripgrep](https://github.com/BurntSushi/ripgrep)? — Slaiyer, Jan 08 '20 at 21:48

RobertBaron · Answer 1 · 2019-06-03T12:04:29.387

Here is a simple solution that should be fast.

Put your fixed-length strings to search, one by line, into a file, and sort the file. Call that file S.

For each file that you want to search, do:

If the length of the strings to search is k, break the file into every possible strings of length k. Call that file B. For example, if k = 5, and the file to search is:
```
abcdefgh
123
123456
```
The file of broken strings would be:
```
abcde
bcdef
cdefg
defgh
12345
23456
```
Now, in order to know the position of each broken string in the oroginal file, append its line and column numbers in file B. For example,
```
abcde 1 1
bcdef 1 2
cdefg 1 3
defgh 1 4
12345 3 1
23456 3 2
```
Sort B, and merge it with S. Call the resulting file M. For example, if S is:
```
23456
cdefg
```
M would be:
```
12345 3 1
23456
23456 3 2
abcde 1 1
bcdef 1 2
cdefg
cdefg 1 3
defgh 1 4
```
Retrieve from M all occurrences of the strings of S found in the file. For example:
```
23456
23456 3 2
cdefg
cdefg 1 3
```
If a string has multiple occurrences, you can get all of them.

I do not know what OS you work with, but the above steps can most likely be performed with commands like sort, awk, grep, etc.

searching static list of fixed strings in a huge number of files

1 Answers1