I have some gigantic unsorted files of IDs like:
file1.txt
a1
a2
a3...etc
file2.txt
b1
a2
c1...etc
And I'm trying to eventually put them a single, sorted file. They are multiple gigabytes each so I can't load them all into memory.
My current solution is to iterate through each file and save lines to new files, based on the first character of each ID. This will created a directory of potentially 26 files, one for each letter of the alphabet. Then, I can combine the files later on since each letter's file can be loaded into memory. This assumes the IDs first characters are evenly alphabetically distributed:
outputs = {}
for filename in listdir(directory):
with open(filename) as f:
for line in f:
if line[0] not in outputs:
outputs[line[0]] = open('sorted_' + line[0] + '.txt', 'w')
outputs[line[0]].write(line)
_ = [v.close() for v in outputs.itervalues()]
(then sort individually and concat the newly categorized files)
My question here is: how much of the content of the new 26 files is being kept in memory? Is it instantly written to the file, or only actually written after closing?
I notice that usually if I cat
tthe file being created in another terminal window, it doesn't actually contain the contents that you want to write to it until .close()
is called. But they could be kept in a temporary file, I'm not sure.
Is this just keeping everything in memory and thus hugely inefficient?