Python's difflib SequenceMatcher speed up

Question

I'm using difflib SequenceMatcher (ratio() method) to define similarity between text files. While difflib is relatively fast to compare a small set of text files e.g. 10 files of 70 kb on average comparing to each other (46 comparisons) takes about 80 seconds.

The issue here is that i have a collection of 3000 txt files (75 kb on average), a raw estimation on how much time SequenceMatcher needs to complete the comparison job is 80 days!

I tried "real_quick_ratio()" and "quick_ratio()" methods, but they don't fit to our needs.

Is there any way to speed up the comparison process? If not, is there any other faster method to do such a task? Even if it is not in Python.

andres.riancho · Answer 1 · 2018-06-13T21:21:33.303

The issue you're finding is very common, since difflib is not optimized. Here are some tricks I've found over the years while developing a tool that compares HTML documents.

Files fit in memory

Create two lists, containing the lines from each file. Then call difflib.SequenceMatcher with the lists as parameters. The SequenceMatcher knows how to handle lists, and the process will be much faster since it is done on a line by line basis instead of char by char. This might reduce the precision.

Take a look at fuzzy_string_cmp.py and diff.py to see how I'm doing exactly this.

Alternative

There is a great library called diff_match_patch which is available in pypi. The library will perform fast diffs between two strings and return the changes (line added, line equal, line removed).

By leveraging diff_match_patch you should be able to create your own dmp_quick_ratio function.

In diff.py you can see how I'm using the library to get inspiration for creating dmp_quick_ratio.

My tests showed that using diff_match_patch was 20 times faster than Python's difflib.

score 6 · Answer 2 · answered Mar 18 '21 at 15:30

6

There is a C implementation of difflib.SequenceMatcher, cdifflib.

Replace the SequenceMatcher and all difflib operations will be about 4x faster

from cdifflib import CSequenceMatcher
import difflib
difflib.SequenceMatcher = CSequenceMatcher

answered Mar 18 '21 at 15:30

greggmi

445
4
14

score -6 · Answer 3 · answered Jun 30 '15 at 07:32

-6

You can get a small speedup using pypy

http://pypy.org/

answered Jun 30 '15 at 07:32

ark

749
3
8
29

2

This recommendation is too generic. – andres.riancho May 07 '18 at 19:22

Python's difflib SequenceMatcher speed up

3 Answers3

Files fit in memory

Alternative

Linked