4

I'm struggling with some performance complications. The task in hand is to extract the similarity value between two strings. For this I am using fuzzywuzzy:

from fuzzywuzzy import fuzz

print fuzz.ratio("string one", "string two")
print fuzz.ratio("string one", "string two which is significantly different")
result1 80
result2 38

However, this is OK. The problem that I'm facing is that I have two lists, one has 1500 rows and the other several thousand. I need to compare all elements of the first agains all elements of the second one. Simple for in a for loop will take ridiculously big amount of time to compute.

If anyone has a suggestion how can I speed this up, it would be highly appreciated.

VnC
  • 1,936
  • 16
  • 26
  • 2
    If you literally have to individually compare each element against each other element, there is no way around the expensive O(n^2) double-for-loop operation you are concerned about. However, we may be able to help you with optimizations if you give more information about the problem you are trying to solve, the type of the elements involved, and why you feel you have to compare each element. – Chris Redford Aug 21 '16 at 17:09
  • The idea is to get a count of how many times each of these 1500 statements appear in a list of tweets (which contains several thousand entries). – VnC Aug 21 '16 at 17:28

4 Answers4

2

I've made something on my own for you (python 2.7):

from __future__ import division

import time
from itertools import izip

from fuzzywuzzy import fuzz


one = "different simliar"
two = "similar"


def compare(first, second):
    smaller, bigger = sorted([first, second], key=len)

    s_smaller= smaller.split()
    s_bigger = bigger.split()
    bigger_sets = [set(word) for word in s_bigger]

    counter = 0
    for word in s_smaller:
        if set(word) in bigger_sets:
            counter += len(word)
    if counter:
        return counter/len(' '.join(s_bigger))*100 # percentage match
    return counter


start_time = time.time()
print "match: ", compare(one, two)
compare_time = time.time() - start_time
print "compare: --- %s seconds ---" % (compare_time)
start_time = time.time()
print "match: ", fuzz.ratio(one, two)
fuzz_time = time.time() - start_time
print "fuzzy: --- %s seconds ---" % (fuzz_time)
print
print "<simliar or similar>/<length of bigger>*100%"
print 7/len(one)*100
print
print "Equals?"
print 7/len(one)*100 == compare(one, two)
print
print "Faster than fuzzy?"
print compare_time < fuzz_time

So I think mine is faster, but more accurate for you? You decide.

EDIT Now is not only faster, but also more accurate.

Result:

match:  41.1764705882
compare: --- 4.19616699219e-05 seconds ---
match:  50
fuzzy: --- 7.39097595215e-05 seconds ---

<simliar or similar>/<length of bigger>*100%
41.1764705882

Equals?
True

Faster than fuzzy?
True

Of course if you would have words check like fuzzywuzzy does, then here you go:

from __future__ import division
from itertools import izip
import time

from fuzzywuzzy import fuzz


one = "different simliar"
two = "similar"


def compare(first, second):
    smaller, bigger = sorted([first, second], key=len)

    s_smaller= smaller.split()
    s_bigger = bigger.split()
    bigger_sets = [set(word) for word in s_bigger]

    counter = 0
    for word in s_smaller:
        if set(word) in bigger_sets:
            counter += 1
    if counter:
        return counter/len(s_bigger)*100 # percentage match
    return counter


start_time = time.time()
print "match: ", compare(one, two)
compare_time = time.time() - start_time
print "compare: --- %s seconds ---" % (compare_time)
start_time = time.time()
print "match: ", fuzz.ratio(one, two)
fuzz_time = time.time() - start_time
print "fuzzy: --- %s seconds ---" % (fuzz_time)
print
print "Equals?"
print fuzz.ratio(one, two) == compare(one, two)
print
print "Faster than fuzzy?"
print compare_time < fuzz_time

Result:

match:  50.0
compare: --- 7.20024108887e-05 seconds ---
match:  50
fuzzy: --- 0.000125169754028 seconds ---

Equals?
True

Faster than fuzzy?
True
turkus
  • 4,637
  • 2
  • 24
  • 28
  • 1
    I really appreciate the effort, but this is not exactly what I want. Say that you've got two strings "similar" and "different simliar" (there is an intentional spelling mistake) your example wouldn't even return an output, while `fuzzywuzzy` outputs 50% similarity. – VnC Aug 21 '16 at 18:57
  • @VnC I think second algorithm will meet your criteria. – turkus Aug 21 '16 at 20:41
1

If you need to count the number of times each of the statements appear then no, there is no way I know of to get a huge speedup over the n^2 operations needed to compare the elements in each list. You might be able to avoid some string matching by using the length to rule out the possibility that a match could occur but you still have nested for loops. You would probably spend far more time optimizing it than the amount of processing time it would save you.

qfwfq
  • 2,416
  • 1
  • 17
  • 30
  • I think it's possible @jcolemang, check my solution: http://stackoverflow.com/questions/39066655/all-to-all-comparison-of-two-lists-in-python/39067506#39067506 – turkus Aug 21 '16 at 20:22
  • @turkus what I meant in my answer was that you can't get a time complexity improvement over n^2 (which I should have worded better). I believe your answer is showing how to improve the individual comparisons and not an algorithm improving on how to match the two lists together. – qfwfq Aug 21 '16 at 21:07
1

The best solution I can think of is to use the IBM Streams framework to parallelize your essentially unavoidable O(n^2) solution.

Using the framework, you would be able to write a single-threaded kernel similar to this

def matchStatements(tweet, statements):
    results = []
    for s in statements:
        r = fuzz.ratio(tweet, s)
        results.append(r)
    return results

Then parallelize it using a setup similar to this

def main():
    topo = Topology("tweet_compare")
    source = topo.source(getTweets)
    cpuCores = 4
    match = source.parallel(cpuCores).transform(matchStatements)
    end = match.end_parallel()
    end.sink(print)

This multithreads the processing, speeding it up substantially while saving you the work of implementing the details of the multithreading yourself (which is the primary advantage of Streams).

The idea is that each tweet is a Streams tuple to be processed across multiple processing elements.

The Python topology framework documentation for Streams is here and the parallel operator in particular is described here.

Chris Redford
  • 16,982
  • 21
  • 89
  • 109
0

You can convert the columns to list using the column_name.tolist() and assign to a variable.

There is a python package called two-lists-similarity which compares the lists of two columns and computes a score.

https://pypi.org/project/two-lists-similarity/

zatamine
  • 3,458
  • 3
  • 25
  • 33
sirishp
  • 201
  • 1
  • 2
  • 4