I have about 50k csv-like files separated by space, each one has about tens of millions of rows. The first column is always string without any space, and the second one is alway positive integer, and there is no missing data. In this problem, I am ONLY interested in the first column, so please ignore the second column. Here are toy examples of two such csv files.
example1.csv
f1 x
f2 x
f5 x
f7 x
...
example2.csv
f1 x
f2 x
f3 x
f4 x
f6 x
...
As you can see, the feature set in the two files have overlap but are not the same. What I want to do is to combine data from all 50k csv files and transform it into the following form.
file_name f1 f2 f3 f4 f5 f6 f7 ....
example1.csv 1 1 0 0 1 0 1 ...
example2.csv 1 1 1 1 0 1 0 ...
...
So it's basically to construct a matrix of file_name
x feature_id
, if the feature_id
exists in the file, then it's 1, otherwise 0. The parsing here is relatively simple, the focus is on scalability, and the number of rows may go up to billions in future projects. I have access to machines with up to one or two terabytes of memory and 100 cores. So I guess memory constraint is less of a concern, but my naive implementation as shown below works well on toy examples, but gets too slow for real ones, and appears to hang when it reaches about 310000 lines in the first file, which I am not sure why. (Do you know why? My intuition says it may have something to do with defaultdict, not sure how it's implemented and it might be expensive to use.) I would like the solution to be reasonably fast. The solution is preferably to be in Python, but other languages are fine, too.
import os
import gzip
from collections import defaultdict
import pandas as pd
# collect paths to all csv-like files
with open('file_list.txt') as inf:
inputs = [_.split() for _ in inf.readlines]
inputs_map = dict(zip(inputs, range(len(inputs))))
res = defaultdict(lambda :[0] * len(inputs))
for k, infile in enumerate(inputs):
print(k, infile)
source_file = os.path.abspath(infile)
source_file_id = inputs_map[source_file]
# starting parsing the csv-like file
with gzip.open(source_file, 'rt') as inf:
for kl, line in enumerate(inf):
feature_id = line.split()[0]
res[feature_id][source_file_id] = 1
if (kl + 1) % 10000 == 0:
print('File {0}'.format(k), 'Line {0}'.format(kl + 1), source_file)
df = pd.DataFrame(res)
df.index = inputs
print('starting writing to disk...')
df.T.to_csv('output.csv')