Splitting a large file into multiple based on line content

Question

Input from stream:

tag tag_id

tag0000001  12312
tag0000002  12
tag0000003  3
tag0000004  8
tag0000005  12312
tag0000006  12312
... ...

Preferred output (can be in a single or multiple files):

12312   tag0000001  tag0000005  tag0000006  ...
12  tag0000002  ...
3   tag0000003  ...
8   tag0000004  ...
... ...

Additional info:

Input is piped in from another program (read from a binary file - can be read multiple times)
The input has hundreds of millions of rows
Number of input rows is known
All output files/different tag_ids are known (tens of thousands)
The number of tags in an output file vary greatly (from a couple to millions)

What I've tried so far:

from queue import Queue, Empty
from threading import Thread
import fileinput
import fcntl
import os

num_threads = 10
total_in_memory = 1000000
single_in_memory_count = int(total_in_memory / num_threads)

def process_tag_queue(q):
    # Each thread gets a line from the queue, saves it into a dict and flushes
    # the dict into a file when single_in_memory_count is reached
    current_in_mem_count = 0
    tag_dict = {}
    while True:
        try:
            tag, tag_id = q.get().rstrip().split("\t")
        except Empty:
            break

        if tag_id in tag_dict:
            tag_dict[tag_id].append(tag)
        else:
            tag_dict[tag_id] = [tag]
        current_in_mem_count +=1
        if current_in_mem_count == single_in_memory_count:
            write_tag_dict_to_file(tag_dict)
            current_in_mem_count = 0
            tag_dict = {}
        q.task_done()
    write_tag_dict_to_file(tag_dict) # Add remaining

def write_tag_dict_to_file(tag_dict):
    for tag_id in tag_dict:
        out_file = f"{tag_id}.tsv"

        f = open(out_file, "a")
        fcntl.flock(f.fileno(), fcntl.LOCK_EX)
        f.write("\t"+"\t".join(tag_dict[tag_id]))
        fcntl.flock(f.fileno(), fcntl.LOCK_UN)
        f.close()

def run_single(): # Save tags into a dict and flush it into a file single_in_memory_count is reached
    tag_dict = {}
    current_in_mem_count = 0
    for line in fileinput.input():
        tag, tag_id = line.rstrip().split("\t")

        if tag_id in tag_dict:
            tag_dict[tag_id].append(tag)
        else:
            tag_dict[tag_id] = [tag]
        current_in_mem_count +=1
        if current_in_mem_count == single_in_memory_count:
            write_tag_dict_to_file(tag_dict)
            current_in_mem_count = 0
            tag_dict = {}
    write_tag_dict_to_file(tag_dict) 

def run_multi(): # Main thread puts lines into queue 
    q = Queue(maxsize=5000) #Don't let the queue get too long
    for i in range(num_threads):
        thread = Thread(target=process_tag_queue, args=(q,))
        thread.setDaemon(True)
        thread.start()
    for line in fileinput.input():
        q.put(line)
    q.join()

if __name__ == '__main__':
    run_single()
    #run_multi()

While both ways work, I was somewhat disappointed by the performance of the multithreaded approach - single-threaded was much faster. Using python multiprocessing was even worse as the elements added to the queue are time-consumingly pickled.

Answers to similar questions recommend keeping the output files open but this doesn't seem reasonable with thousands of output files. Due to the size of the input file I'd avoid converting it into a text file and sorting it.

Could multithreading/processing even give a performance boost in this context?

Any suggestions or tips are welcome.

Does this answer your question? [Python, multithreading too slow, multiprocess](https://stackoverflow.com/questions/8774989/python-multithreading-too-slow-multiprocess) — Jérôme Richard, Apr 19 '21 at 18:00
@JérômeRichard Unfortunately, no. As mentioned, sending data over to a process via queue is terribly slow. But this gave me an idea to try going through the large file with a couple of processes each given their own ids to look out for and write. More reading but at least no GIL. — Michael, Apr 20 '21 at 10:26
Some thoughts: Threading won't help if it's io-bound. `total in_memory` seems quite low. Just an idea: a shared `collections.defaultdict` using using `multiprocess.manager`. Each thread get it's own chunk of the file. `Defaultdict` is very efficient and auto-materialize values when key is missing thus no need to check if key exists or handle keyerror exceptions. Defaultdict is also thread safe. — IODEV, Apr 20 '21 at 18:16
@IODEV Did I get this right: main fills the defaultdict while subprocesses are constantly checking if the dict is "full"? I've currently pumped up the in_memory count and keeping as many files open as I dare (100). — Michael, Apr 20 '21 at 19:27

Splitting a large file into multiple based on line content

0 Answers0