If you can't load the file to memory, why not split it smartly to smaller files and work there. You only need to know that identical lines end up in identical files, and you want some collisions to not end up with a huge amount of files.
Here is a script that takes the prefix of each sentence (can be changed obviously) and puts the sentence in the file corresponsing to the prefix.
This is actually much like a hash map, only not in memory as your RAM can not handle the amount of data you're tryiing to process.
The result is many smaller files (buckets, if you will..) that will have all occurrences of a line grouped in a certain file (same prefix). They can be unique-d individually and then merged to the result file.
Here is how it's done:
Initializing the program to read from the file input.txt
, and write to output.txt
, using a prefix size of 2
to hash/split:
import os
input_file_name = 'input.txt'
split_folder = 'splits'
prefix_size = 2
Create the folder holding the split files containing similar and identical lines:
# create hash files folder
if not os.path.exists(split_folder):
os.makedirs(split_folder)
Line-distributing function - puts a line in a specified file:
# a function to put a line in a file
def put_in_file(file_name, line):
with open(os.path.join(split_folder, file_name), 'a') as f:
f.write(line)
Hash function that promises some collision (which is good), and identical lines being in a similar file:
def prefix_hash(line):
return line[:prefix_size]
Now we distribute lines to their smaller files (like hash "buckets")
with open(input_file_name) as f:
# convenience method
def putter(line):
put_in_file(prefix_hash(line), line)
for line in f:
putter(
line + (os.linesep if not line.endswith(os.linesep) else '')
)
Generate a list of created file names:
split_file_names = map(
lambda x: os.path.join(split_folder, x), os.listdir(split_folder)
)
De-duplicate lines in the smaller files:
for split_file_name in split_file_names:
# dedup each file
with open(split_file_name, 'r') as f:
unique_lines = set(f.readlines())
with open(split_file_name, 'w') as f:
f.write(''.join(unique_lines))
Join smaller files into the result file:
output_file = "output.txt"
with open(output_file, 'w') as of:
for split_file_name in split_file_names:
with open(split_file_name, 'r') as f:
of.write(f.read())
The whole thing together:
import os
input_file_name = 'input.txt'
split_folder = 'splits'
prefix_size = 2
# create hash files folder
if not os.path.exists(split_folder):
os.makedirs(split_folder)
# a function to put a line in a file
def put_in_file(file_name, line):
with open(os.path.join(split_folder, file_name), 'a') as f:
f.write(line)
def prefix_hash(line):
return line[:prefix_size]
with open(input_file_name) as f:
# convenience method
def putter(line):
put_in_file(prefix_hash(line), line)
for line in f:
putter(
line + (os.linesep if not line.endswith(os.linesep) else '')
)
split_file_names = map(
lambda x: os.path.join(split_folder, x), os.listdir(split_folder)
)
for split_file_name in split_file_names:
# dedup each file
with open(split_file_name, 'r') as f:
unique_lines = set(f.readlines())
with open(split_file_name, 'w') as f:
f.write(''.join(unique_lines))
output_file = "output.txt"
with open(output_file, 'w') as of:
for split_file_name in split_file_names:
with open(split_file_name, 'r') as f:
of.write(f.read())
Note: To make this a lot faster you should keep the file handlers open at all times and probably make some threads using a queue to pass lines among them (prevents waiting for I/O as well as opening and closing the files). I can add this later if anyone wants it.