Searching a large list of words in another large list

Question

I have a sorted list of 1,000,000 strings with a maximum length of 256 with protein names. Every string has an associated ID. I have another unsorted list of 4,000,000,000 strings with a maximum length of 256 with words out of articles and every word has an ID.

I want to find all matches between the list of protein names and the list of words of the articles. Which algorithm should I use? Should I use some prebuild API?

It would be good if the algorithm runs on a normal PC without special hardware.

Estimates of time that the algorithm requires would be nice but not obligatory.

How can we do this in 2020 with Python ? – pnv Sep 16 '20 at 10:50 — pnv, Sep 16 '20 at 10:50
@pvarma : Comments are not a place to ask new questions. – Christian Sep 16 '20 at 11:45 — Christian, Sep 16 '20 at 11:45

score 1 · Accepted Answer · edited Feb 20 '17 at 22:44

1

4 billion strings is a lot of strings to search.

You may be able to fit the entire data structure into a memory hash for fast lookup, but more likely you'd want to store the entire list on more spacious (but slower) disk in which case a sorted list would lend itself to the relatively efficient binary search algorithm.

If your binary search or such function was called find_string_in_articles(), then pseudocode:

foreach $protein_name ( @protein_names ) {
    if ( $article_id = find_string_in_articles( $protein_name ) ) {
        print( "$protein_name matches $article_id\n" );
    }
}

edited Feb 20 '17 at 22:44

djromero

19,551
4
71
68

answered Mar 31 '10 at 23:37

PP.

10,764
7
45
59

Most search algorithms on disk storage are horrendous performance-wise. Swap the collections so you can do lookup in memory on proteins, and sequential scan the article words. – Simon Buchan Apr 01 '10 at 00:23

score 1 · Answer 2 · answered Mar 31 '10 at 23:39

You could sort them and then do "mergesort" which would not actually merge but find you duplicates/overlaps. Wikipedia has good references on that.

Sorting that amount of data probably requires more memory than you have accessible. I don't know if unix sort (available on Windows/Mac too) can handle that, but any decent SQL database can do that.

Another possibility is to use a radix tree on your protein names (those starting with A go to bin A, B to bin B etc.). Then just loop over the 4 gazillion of words and locate overlaps (you probably must implement more than one deep radix binning to discard more proteins at a time).

score 1 · Answer 3 · answered Apr 01 '10 at 00:11

This is essentially a relational join. Assuming you don't have already sorted article words, your basic algorithm should be:

for word in article_words:
    if (proteins.find(word)):
        found_match(word)

proteins.find() is the difficult part, and you will have to experiment to get the best performance, this sort of problem is where cache effects start to come into play. I would first try with a radix sort, it's pretty simple and is likely fast enough, but binary searching and hashing are also alternatives.

score 0 · Answer 4 · answered Mar 31 '10 at 23:36

0

Sounds like something you should use a binary-tree for maybe.

answered Mar 31 '10 at 23:36

Hobblin

905
1
12
22

score 0 · Answer 5 · answered Mar 31 '10 at 23:43

0

I would go about this in 1 of 2 ways.

Insert it into a sql database and pull out the data you need (slower, but easier)
Sort the list, then do binary searches to find what you need (fast, but tricky)

answered Mar 31 '10 at 23:43

rook

66,304
38
162
239

Searching a large list of words in another large list

5 Answers5

Linked