I have around 10,000 text files and quite a lot of them have very similar content. I am trying to get rid of the files that are very similar to each other so that I am left with a smaller and more unique set. Just for reference the contents of the text files can be several pages long.
I am trying to solve this through measuring the string distance of the contents through their levenshtein-distance. I have tried some ways to reduce the number of comparisons such as only running comparisons on files of similar size and similar text length just to get some quick gains.
text_files = {}
for item in os.listdir(text_directory):
text_files.update({item : os.path.getsize(text_directory+item)})
count = 0
def Find_Similar_Text(text_files, count):
count = count
tic = time.process_time()
for a, b in itertools.combinations(text_files, 2):
if text_files[a] - 50 < text_files[b] < text_files[a] + 50:
file1 = open(text_directory + a, 'rb')
file1_data = file1.read()
file1.close()
file2 = open(text_directory + b, 'rb')
file2_data = file2.read()
file2.close()
if (-100 < len(file1_data) - len(file2_data) < 100):
ratio = fuzz.ratio(file1_data, file2_data)
if ratio > 70:
count+=1
print(count, 'Ratio:', ratio, a, text_files[a], 'kb', b, text_files[b], 'kb')
shutil.move(text_directory + a, text_directory + '//SimilarFiles//')
text_files.pop(a)
toc = time.process_time()
print('Elapsed time:', toc - tic)
Find_Similar_Text(text_files, count)
Find_Similar_Text(text_files, count)
i know that currently this will run into an endless loop at the end of the process due to the recursive nature, but I'm still quite far from reaching that bridge.