What's the most efficient way to find which lists of strings are similar if you have n lists?

Question

Say I have 10 unordered lists of 100 string elements in each. What's the fastest way to find which lists have a high degree of overlap (e.g. 50%+) with another list or lists, and which list(s) they overlap with?

What would if we scaled it up to 1,000,000,000 unordered lists of 10,000 strings each? What's the most efficient way to identify these lists?

score 0 · Answer 1 · answered Jan 14 '17 at 14:48

0

this is a slow operation. you would create a Set from all your lists. then compare one against all others, keeping a certain score in a hashtable or soemthing, then continue to the next and do it again. it's very slow and would not scale well, but depending on the domain that you're looking for, there may be algorithms (and data structures) specifically tailored to that operation. for example fuzzy searching and string matching. Your question is too broad. what is it specifically that you're attempting to do?

answered Jan 14 '17 at 14:48

kobi7

969
1
7
15

That makes sense. I was mainly wondering if it was possible to do this type of comparison in a way that's faster than quadratic time. I agree that the question is broad, but that's because this was an abstract question my friend asked me. I don't have a specific set of docs that I'm trying to analyze. Thanks! – user7418754 Jan 20 '17 at 06:26

score 0 · Answer 2 · answered Jan 14 '17 at 15:58

0

If you want to find the similiraties between 2 documents you should take a look at TfidVectorize . Can you provide us with some sample lists or documents and desired output.

answered Jan 14 '17 at 15:58

NinjaGaiden

3,046
6
28
49

That's helpful, thanks! This is also just an abstract question my friend asked me, so I don't have any real data I'm trying to work with here. – user7418754 Jan 20 '17 at 06:25

What's the most efficient way to find which lists of strings are similar if you have n lists?

2 Answers2