i have the following code for detecting duplicates in a file and outputing them in 3 separate files, one for non-duplicates, one for duplicates (x2) and one for duplicates (> x2). The first file, holds only lines that had no duplicates in the original file. (It doesnt remove any duplicate lines found, it keeps singles).
import os
import sys
import time
import collections
file_in = sys.argv[1]
file_ot = str(file_in) + ".proc"
file_ot2 = str(file_in) + ".proc2"
file_ot3 = str(file_in) + ".proc3"
counter = 0
dict_in = collections.defaultdict(list)
with open(file_in, "r") as f:
for line in f:
#print("read line: " + str(line))
counter += 1
fixed_line = line.strip()
line_list = fixed_line.split(";")
key = line_list[0][:12]
print(":Key: " + str(key))
dict_in[key].append(line)
with open(file_ot, "w") as f1, open(file_ot2, "w") as f2, open(file_ot3, "w") as f3:
selector = {1: f1, 2: f2}
for values in dict_in.values():
if len(values) == 1:
f1.writelines(values)
elif len(values) == 2:
f2.writelines(values)
else:
f3.writelines(values)
print("Read: " + str(counter) + " lines")
The above code works, but for v large files (~1g) it takes about ten minutes to chomp through them on my system. I was wondering if there was a way to optimize the speed of this code, or any suggestions in that direction. Thank you in advance!
Input data example:
0000AAAAAAAA;X;;X;
0000AAAAAAAA;X;X;;
0000BBBBBBBB;X;;;
0000CCCCCCCC;;X;;
0000DDDDDDDD;X;;X;
0000DDDDDDDD;X;X;;
0000DDDDDDDD;X;X;X;X
0000EEEEEEEE;X;X;X;X
0000FFFFFFFF;X;;;
0000GGGGGGGG;X;;X;
0000HHHHHHHH;X;X;;
0000JJJJJJJJ;X;X;;
Expected output:
FILE1:
0000BBBBBBBB;X;;;
0000CCCCCCCC;;X;;
0000EEEEEEEE;X;X;X;X
0000FFFFFFFF;X;;;
0000GGGGGGGG;X;;X;
0000HHHHHHHH;X;X;;
0000JJJJJJJJ;X;X;;
FILE2:
0000AAAAAAAA;X;;X;
0000AAAAAAAA;X;X;;
FILE3:
0000DDDDDDDD;X;;X;
0000DDDDDDDD;X;X;;
0000DDDDDDDD;X;X;X;X