More efficient way to compare the contents of thousands of text files

Question

I have around 10,000 text files and quite a lot of them have very similar content. I am trying to get rid of the files that are very similar to each other so that I am left with a smaller and more unique set. Just for reference the contents of the text files can be several pages long.

I am trying to solve this through measuring the string distance of the contents through their levenshtein-distance. I have tried some ways to reduce the number of comparisons such as only running comparisons on files of similar size and similar text length just to get some quick gains.

text_files = {}
for item in os.listdir(text_directory):
    text_files.update({item : os.path.getsize(text_directory+item)})

count = 0

def Find_Similar_Text(text_files, count):
    count = count
    tic = time.process_time()
    for a, b in itertools.combinations(text_files, 2):
        if text_files[a] - 50 < text_files[b] < text_files[a] + 50:
            file1 = open(text_directory + a, 'rb')
            file1_data = file1.read()
            file1.close()

            file2 = open(text_directory + b, 'rb')
            file2_data = file2.read()
            file2.close()
            if (-100 < len(file1_data) - len(file2_data) < 100):
                ratio = fuzz.ratio(file1_data, file2_data)
                if ratio > 70:
                    count+=1
                    print(count, 'Ratio:', ratio, a, text_files[a], 'kb', b, text_files[b], 'kb')
                    shutil.move(text_directory + a, text_directory + '//SimilarFiles//')
                    text_files.pop(a)
                    toc = time.process_time()
                    print('Elapsed time:', toc - tic)
                    Find_Similar_Text(text_files, count)

Find_Similar_Text(text_files, count)

i know that currently this will run into an endless loop at the end of the process due to the recursive nature, but I'm still quite far from reaching that bridge.

im looking for a faster method to compare the contents of the files. The one i have above works but is very slow. I am looking for a faster alternative or thoughts on how i could optimise this further — TheDogfather17, Jun 13 '19 at 09:41
You don't need to do this recursively. Do you know if your code is spending most of its time doing file I/O or the fuzzy compares? If it's I/O bound then you might be able speed it up by using multithreading. — martineau, Jun 13 '19 at 11:27

score 0 · Answer 1 · answered Jun 13 '19 at 09:33

0

no need of this reccursive line, atleast. Find_Similar_Text(text_files, count) , make itertools.combinations(text_files, 2): a variable and update it and use for loop.

answered Jun 13 '19 at 09:33

Jainil Patel

1,284
7
16

thanks for the suggestion, But then wouldn't the for loop keep using the old dictionary rather than the updated one after the pop? – TheDogfather17 Jun 13 '19 at 09:58
don't you think you are reccurring infinite without any reason. – Jainil Patel Jun 13 '19 at 10:02
yes, but while i can update the dictionary within the for loop I cant figure out how to also execute the updated dictionary in the for loop, which is why i went with recursion. – TheDogfather17 Jun 13 '19 at 10:06
try making another function, that returns changed list, if it returns then call it once again. – Jainil Patel Jun 13 '19 at 10:09

More efficient way to compare the contents of thousands of text files

1 Answers1