0

I have around 10,000 text files and quite a lot of them have very similar content. I am trying to get rid of the files that are very similar to each other so that I am left with a smaller and more unique set. Just for reference the contents of the text files can be several pages long.

I am trying to solve this through measuring the string distance of the contents through their levenshtein-distance. I have tried some ways to reduce the number of comparisons such as only running comparisons on files of similar size and similar text length just to get some quick gains.

text_files = {}
for item in os.listdir(text_directory):
    text_files.update({item : os.path.getsize(text_directory+item)})

count = 0

def Find_Similar_Text(text_files, count):
    count = count
    tic = time.process_time()
    for a, b in itertools.combinations(text_files, 2):
        if text_files[a] - 50 < text_files[b] < text_files[a] + 50:
            file1 = open(text_directory + a, 'rb')
            file1_data = file1.read()
            file1.close()

            file2 = open(text_directory + b, 'rb')
            file2_data = file2.read()
            file2.close()
            if (-100 < len(file1_data) - len(file2_data) < 100):
                ratio = fuzz.ratio(file1_data, file2_data)
                if ratio > 70:
                    count+=1
                    print(count, 'Ratio:', ratio, a, text_files[a], 'kb', b, text_files[b], 'kb')
                    shutil.move(text_directory + a, text_directory + '//SimilarFiles//')
                    text_files.pop(a)
                    toc = time.process_time()
                    print('Elapsed time:', toc - tic)
                    Find_Similar_Text(text_files, count)

Find_Similar_Text(text_files, count)

i know that currently this will run into an endless loop at the end of the process due to the recursive nature, but I'm still quite far from reaching that bridge.

  • 2
    What is your *specific* question? – Klaus D. Jun 13 '19 at 09:32
  • im looking for a faster method to compare the contents of the files. The one i have above works but is very slow. I am looking for a faster alternative or thoughts on how i could optimise this further – TheDogfather17 Jun 13 '19 at 09:41
  • Faster than endless recursion? Almost anything will do. – Stop harming Monica Jun 13 '19 at 09:44
  • You don't need to do this recursively. Do you know if your code is spending most of its time doing file I/O or the fuzzy compares? If it's I/O bound then you might be able speed it up by using multithreading. – martineau Jun 13 '19 at 11:27

1 Answers1

0

no need of this reccursive line, atleast. Find_Similar_Text(text_files, count) , make itertools.combinations(text_files, 2): a variable and update it and use for loop.

Jainil Patel
  • 1,284
  • 7
  • 16