Input from stream:
tag tag_id
tag0000001 12312
tag0000002 12
tag0000003 3
tag0000004 8
tag0000005 12312
tag0000006 12312
... ...
Preferred output (can be in a single or multiple files):
12312 tag0000001 tag0000005 tag0000006 ...
12 tag0000002 ...
3 tag0000003 ...
8 tag0000004 ...
... ...
Additional info:
- Input is piped in from another program (read from a binary file - can be read multiple times)
- The input has hundreds of millions of rows
- Number of input rows is known
- All output files/different tag_ids are known (tens of thousands)
- The number of tags in an output file vary greatly (from a couple to millions)
What I've tried so far:
from queue import Queue, Empty
from threading import Thread
import fileinput
import fcntl
import os
num_threads = 10
total_in_memory = 1000000
single_in_memory_count = int(total_in_memory / num_threads)
def process_tag_queue(q):
# Each thread gets a line from the queue, saves it into a dict and flushes
# the dict into a file when single_in_memory_count is reached
current_in_mem_count = 0
tag_dict = {}
while True:
try:
tag, tag_id = q.get().rstrip().split("\t")
except Empty:
break
if tag_id in tag_dict:
tag_dict[tag_id].append(tag)
else:
tag_dict[tag_id] = [tag]
current_in_mem_count +=1
if current_in_mem_count == single_in_memory_count:
write_tag_dict_to_file(tag_dict)
current_in_mem_count = 0
tag_dict = {}
q.task_done()
write_tag_dict_to_file(tag_dict) # Add remaining
def write_tag_dict_to_file(tag_dict):
for tag_id in tag_dict:
out_file = f"{tag_id}.tsv"
f = open(out_file, "a")
fcntl.flock(f.fileno(), fcntl.LOCK_EX)
f.write("\t"+"\t".join(tag_dict[tag_id]))
fcntl.flock(f.fileno(), fcntl.LOCK_UN)
f.close()
def run_single(): # Save tags into a dict and flush it into a file single_in_memory_count is reached
tag_dict = {}
current_in_mem_count = 0
for line in fileinput.input():
tag, tag_id = line.rstrip().split("\t")
if tag_id in tag_dict:
tag_dict[tag_id].append(tag)
else:
tag_dict[tag_id] = [tag]
current_in_mem_count +=1
if current_in_mem_count == single_in_memory_count:
write_tag_dict_to_file(tag_dict)
current_in_mem_count = 0
tag_dict = {}
write_tag_dict_to_file(tag_dict)
def run_multi(): # Main thread puts lines into queue
q = Queue(maxsize=5000) #Don't let the queue get too long
for i in range(num_threads):
thread = Thread(target=process_tag_queue, args=(q,))
thread.setDaemon(True)
thread.start()
for line in fileinput.input():
q.put(line)
q.join()
if __name__ == '__main__':
run_single()
#run_multi()
While both ways work, I was somewhat disappointed by the performance of the multithreaded approach - single-threaded was much faster. Using python multiprocessing was even worse as the elements added to the queue are time-consumingly pickled.
Answers to similar questions recommend keeping the output files open but this doesn't seem reasonable with thousands of output files. Due to the size of the input file I'd avoid converting it into a text file and sorting it.
Could multithreading/processing even give a performance boost in this context?
Any suggestions or tips are welcome.