2

I have about 50k csv-like files separated by space, each one has about tens of millions of rows. The first column is always string without any space, and the second one is alway positive integer, and there is no missing data. In this problem, I am ONLY interested in the first column, so please ignore the second column. Here are toy examples of two such csv files.

example1.csv

f1 x
f2 x
f5 x
f7 x
...

example2.csv

f1 x
f2 x
f3 x
f4 x
f6 x
...

As you can see, the feature set in the two files have overlap but are not the same. What I want to do is to combine data from all 50k csv files and transform it into the following form.

file_name    f1 f2 f3 f4 f5 f6 f7 ....
example1.csv 1  1  0  0  1  0  1  ...
example2.csv 1  1  1  1  0  1  0  ...
...

So it's basically to construct a matrix of file_name x feature_id, if the feature_id exists in the file, then it's 1, otherwise 0. The parsing here is relatively simple, the focus is on scalability, and the number of rows may go up to billions in future projects. I have access to machines with up to one or two terabytes of memory and 100 cores. So I guess memory constraint is less of a concern, but my naive implementation as shown below works well on toy examples, but gets too slow for real ones, and appears to hang when it reaches about 310000 lines in the first file, which I am not sure why. (Do you know why? My intuition says it may have something to do with defaultdict, not sure how it's implemented and it might be expensive to use.) I would like the solution to be reasonably fast. The solution is preferably to be in Python, but other languages are fine, too.

import os
import gzip
from collections import defaultdict

import pandas as pd


# collect paths to all csv-like files
with open('file_list.txt') as inf:
    inputs = [_.split() for _ in inf.readlines]

inputs_map = dict(zip(inputs, range(len(inputs))))

res = defaultdict(lambda :[0] * len(inputs))
for k, infile in enumerate(inputs):
    print(k, infile)
    source_file = os.path.abspath(infile)
    source_file_id = inputs_map[source_file]
    # starting parsing the csv-like file
    with gzip.open(source_file, 'rt') as inf:
        for kl, line in enumerate(inf):
            feature_id = line.split()[0]
            res[feature_id][source_file_id] = 1
            if (kl + 1) % 10000 == 0:
                print('File {0}'.format(k), 'Line {0}'.format(kl + 1), source_file)

df = pd.DataFrame(res)
df.index = inputs
print('starting writing to disk...')
df.T.to_csv('output.csv')
zyxue
  • 7,904
  • 5
  • 48
  • 74
  • For two files you can get the row names with `cat example1.csv example2.csv | cut -d " " -f 1 | sort -u`. AFAIK `sort` is very fast and efficient and uses the disk if there isn't enough memory. Repeatedly applying this process to pairs of files might be a good way to go. – Alex Hall Nov 11 '16 at 23:49
  • Do you mean get the union of all rows first, and then loop through each file? I still need the information of which file contains which feature though. – zyxue Nov 11 '16 at 23:53
  • can you use hdfs + spark ?as of scalability, it would seems to be the good tools ? – Romain Jouin Nov 11 '16 at 23:58
  • 1
    If you have all the row names and you sort the input files, it would be easy to write a program to generate the matrix in linear time and constant memory. – Alex Hall Nov 11 '16 at 23:59
  • @romainjouin, hdfs + spark sounds too much overhead for now. But I do think that's the way to go in the future. – zyxue Nov 12 '16 at 00:09
  • 1
    well, if your issue is about scaling, I think they would worse a try ;) And there is this nice page on your issue : http://stackoverflow.com/questions/24029873/how-to-read-multiple-text-files-into-a-single-rdd – Romain Jouin Nov 12 '16 at 00:13
  • Thanks. I understand what you mean, and I agree to certain extent. I have had experience with Spark before. If not absolute necessary, I don't want to bother with setting up a Spark cluster yet. – zyxue Nov 12 '16 at 00:18
  • How many columns are you going to have in the result DF? I.e. how many __unique__ strings (words) do you have approx. in all CSV files? – MaxU - stand with Ukraine Nov 12 '16 at 06:56
  • Let's say tens of millions for now. – zyxue Nov 12 '16 at 07:20
  • @zyxue, in this case i would go for Hadoop + Spark (DataFrame) solution – MaxU - stand with Ukraine Nov 12 '16 at 08:31
  • Hi @romainjouin, I ended up using Spark, you are very right, it's worth it. Do you have any further instruction on creating such a matrix. It still doesn't look obvious to me. – zyxue Nov 19 '16 at 21:45
  • hi @zyxue Glad the idea have been useful :) did you have a look to the link I provided before : http://stackoverflow.com/questions/24029873/how-to-read-multiple-text-files-into-a-single-rdd ? I am just responding to your question and didn't come back seriously on your question, maybe tomoroww ;) – Romain Jouin Nov 19 '16 at 22:42
  • @romainjouin, I've read it. I was thinking of using dataframes instead of RDD, but now it seems RDD is more suitable for the task at hand. I need to do more exploration, and let you know. – zyxue Nov 19 '16 at 22:45
  • @zyxue - as far as I remember there are no DataFrame in spark - only RDD. Did you also set up hdfs ? The idea is that putting the files on hdfs you can enjoy better disk-access. It can seems daunting at first, but hdfs is not that hard to set up. Do you have a cluster ? => Re-reading your question, you obviously do ;-) – Romain Jouin Nov 19 '16 at 22:50
  • Spark dataframe is evolved from spraksql, if do a quick search you'll see and I think it borrows many concepts from Python pandas and R. I am using Google cloud storage directly instead of Hdfs, the difference at the storage layer will be abstracted away. API won't be any different. – zyxue Nov 19 '16 at 23:10

1 Answers1

2
xto1 = lambda x: 1
def read_ex(fn):
    s = pd.read_csv(
        fn, sep=' ', header=None,
        index_col=0, usecols=[0, 1],
        converters={1: xto1},
        names=[None, fn],
        squeeze=True)
    return s

fs = ['example1.csv', 'example2.csv']

pd.concat([read_ex(f) for f in fs], keys=fs).unstack(fill_value=0)

enter image description here

piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Are you sure this will scale? `[read_ex(f) for f in fs]` looks tricky to me. – zyxue Nov 12 '16 at 00:10
  • I will give it a try – zyxue Nov 12 '16 at 00:23
  • honestly, I'm not sure. Try with a couple of files first. – piRSquared Nov 12 '16 at 00:24
  • It probably won't work as `[read_ex(f) for f in fs]` reads in too much redundant information, and will explode the memory. I did a rough calculation, that list comprehension may consume 3TB memory and take 19 hours. – zyxue Nov 12 '16 at 00:53
  • blah... I'll update post if I think of anything else. – piRSquared Nov 12 '16 at 00:54
  • I think your approach works, but instead of reading all files in in one go, do it in several rounds, it's like building a tree from bottom, creating a number of intermediary node. But ultimately we are only interested in the root node. I will try this approach out later. – zyxue Nov 12 '16 at 08:18
  • @MaxU what about bool or sparse – piRSquared Nov 12 '16 at 13:50
  • @piRSquared, `bool` will take the same 1 byte, but it's not that convenient. Sparse is a good idea - but how are you going to use it in this case? Sparse would be the best option... – MaxU - stand with Ukraine Nov 12 '16 at 13:53