Aggregation tree for parallel file concatenation in Python?

Question

I have a use-case where I need to concat a large number of CSV files into one, maintaining the order of the rows.

For example:

> cat file1.csv:

1,bla,bla
2,bla,bla

> cat file2.csv

2,bla,bla
2,bla,bla
3,bla,bla

> cat desired_output.txt

1,bla,bla
2,bla,bla
2,bla,bla
2,bla,bla
3,bla,bla

Currently, I'm doing this in a serial way, reading each file in sequence and appending to a single concat file (using binary mode to read/write for speedup).

Since the machine I'm using has multiple cores available, I was wondering if there's some easy way in base Python (joblib/pandas is also OK) to create some sort of aggregation tree, so that partial files are merged in parallel, with the output again being a single CSV with the rows in order.

I highly doubt doing this operation in parallel will be faster (at least clearly not in Python). I also expect the operation to be IO bound. Writing in parallel on a (shared) storage device may not be faster. It is actually often slower, especially on HDD. SSD are better suited for that but the OS sometimes do internal locks that prevent the operation to scale. Old SATA SSD should be already saturated in sequential. — Jérôme Richard, Jan 07 '22 at 15:30
WIth all respect, the problem is not related to [PARALLEL]-processing, it is a pure resources-limited just-[CONCURRENT]-processing (having no parallelism-related coordination at all). Using GIL-(re)[SERIAL]-izated processes with HUGE spawn+SER/DES-add-on costs hurts a given use-case. The use-case requirement of "maintaining order" does not, so far, provide any details on (proper) re-ordering of N-many lines that start with the same number M ("2,..."). If theirs order is irrelevant (once coming from any number + order of CSV-file-input processing), then ordering stops matter at all,does it? — user3666197, Jan 07 '22 at 15:54

Aggregation tree for parallel file concatenation in Python?

0 Answers0