0

I have thousands of csv files in disk. Each of them with a size of approximately ~10MB (~10K columns). Most of these columns hold real (float) values.

I would like to create a dataframe by concatenating these files. Once I have this dataframe, I would like to sort its entries by the first two columns.

I currently have the following:

my_dfs = list()
for ix, file in enumerate(p_files):
    my_dfs.append(
       pd.read_csv(p_files[ix], sep=':', dtype={'c1' : np.object_, 'c2' : np.object_}))

print("Concatenating files ...")
df_merged= pd.concat(my_dfs)

print("Sorting the result by the first two columns...")
df_merged = df_merged.sort(['videoID', 'frameID'], ascending=[1, 1])

print("Saving it to disk ..")
df_merged.to_csv(p_output, sep=':', index=False)

But this requires so much memory that my process is killed before getting the result (in the logs I see that the process is killed when its using around 10GB of memory).

I am trying to figure out where exactly it fails, but I am still unable to do it (although I hope to log the stdout soon)

Is there a better way to do this in Pandas?

Amelio Vazquez-Reina
  • 91,494
  • 132
  • 359
  • 564
  • Thanks @Nurgle I have not used SQLite before. Do you have any pointers on how I could move my problem to it? – Amelio Vazquez-Reina May 18 '13 at 00:52
  • 2
    1000s of files * 10MB ~ 10GB, do you have so much RAM available? Probably using a db sort is more indicated in this case. – elyase May 18 '13 at 00:52
  • 1
    Have you heard of the heapq module and merge sorting? Guido implemented this approach for sorting a large collection of files. Check it out here http://neopythonic.blogspot.com/2008/10/sorting-million-32-bit-integers-in-2mb.html – Bryan May 18 '13 at 00:54
  • @elyase. The text data in disk, when concatenated, may take up as much as 10GB, but I understand Pandas converts ASCII floats to float types, and it may have an efficient mechanism to save to disk. – Amelio Vazquez-Reina May 18 '13 at 00:56
  • here is a similiar question: http://stackoverflow.com/questions/16110252/need-to-compare-very-large-files-around-1-5gb-in-python/16110391#16110391 – Jeff May 18 '13 at 01:06

1 Answers1

4

Loading them into a database is easy, flexible for making changes later on and takes advantage of all the optimization work that goes into databases. Once you've loaded it, if you wanted to get an iterable of the data, you could run the following query and be done:

SELECT * FROM my_table ORDER BY column1, column2

I'm pretty sure there are more direct ways to load into sqlite3 within sqlite3, but if you don't want to do it directly in sqlite, you can use python to do load in the data taking advantage of csv reader as an iterator so you only load a minimal amount into memory, as in:

import csv
import sqlite3
conn = sqlite3.Connection(dbpath)
c = conn.cursor()

for path in paths:
    with open(path) as f:
         reader = csv.reader(f)
         c.executemany("INSERT INTO mytable VALUES (?,?,?)""", reader)

that way, you don't have to ever load too much into memory and you can take advantage of sqlite.

After that (if you wanted to again do it in Python) you could do:

import csv
import sqlite3
conn = sqlite3.Connection(dbpath)
c = conn.cursor()

with open(outpath) as f:
    writer = csv.writer
    writer.writerows(c.execute("SELECT * FROM mytable ORDER BY col1, col2"))
Jeff Tratner
  • 16,270
  • 4
  • 47
  • 67
  • Thank you Jeff. This is extremely helpful. I am wondering, is there a way to do this with HDF5? Does PyTables/HDF5 support this type of access? (i.e. select with sorting) – Amelio Vazquez-Reina May 19 '13 at 17:01
  • @user815423426 you can look at how [querying hdf5 in pandas works](http://pandas.pydata.org/pandas-docs/dev/io.html#querying-a-table), but I think you have to load the entire dataframe (or subset of the dataframe) into memory before you can sort/change the table. You can [load from sqlite/sql database into pandas](http://pandas.pydata.org/pandas-docs/dev/io.html#sql-queries) too – Jeff Tratner May 19 '13 at 19:50