6

I am trying to select random rows from a HDFStore table of about 1 GB. RAM usage explodes when I ask for about 50 random rows.

I am using pandas 0-11-dev, python 2.7, linux64.

In this first case the RAM usage fits the size of chunk

with pd.get_store("train.h5",'r') as train:
for chunk in train.select('train',chunksize=50):
    pass

In this second case, it seems like the whole table is loaded into RAM

r=random.choice(400000,size=40,replace=False)
train.select('train',pd.Term("index",r))

In this last case, RAM usage fits the equivalent chunk size

r=random.choice(400000,size=30,replace=False)    
train.select('train',pd.Term("index",r))

I am puzzled, why moving from 30 to 40 random rows induces such a dramatic increase in RAM usage.

Note the table has been indexed when created such that index=range(nrows(table)) using the following code:

def txtfile2hdfstore(infile, storefile, table_name, sep="\t", header=0, chunksize=50000 ):
    max_len, dtypes0 = txtfile2dtypes(infile, sep, header, chunksize)

    with pd.get_store( storefile,'w') as store:
        for i, chunk in enumerate(pd.read_table(infile,header=header,sep=sep,chunksize=chunksize, dtype=dict(dtypes0))):
            chunk.index= range( chunksize*(i), chunksize*(i+1))[:chunk.shape[0]]
            store.append(table_name,chunk, min_itemsize={'values':max_len})

Thanks for insight

EDIT TO ANSWER Zelazny7

Here's the file I used to write Train.csv to train.h5. I wrote this using elements of Zelazny7's code from How to trouble-shoot HDFStore Exception: cannot find the correct atom type

import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer


def object_max_len(x):
    if x.dtype != 'object':
        return
    else:
        return len(max(x.fillna(''), key=lambda x: len(str(x))))

def txtfile2dtypes(infile, sep="\t", header=0, chunksize=50000 ):
    max_len = pd.read_table(infile,header=header, sep=sep,nrows=5).apply( object_max_len).max()
    dtypes0 = pd.read_table(infile,header=header, sep=sep,nrows=5).dtypes

    for chunk in pd.read_table(infile,header=header, sep=sep, chunksize=chunksize):
        max_len = max((pd.DataFrame(chunk.apply( object_max_len)).max(),max_len))
        for i,k in enumerate(zip( dtypes0[:], chunk.dtypes)):
            if (k[0] != k[1]) and (k[1] == 'object'):
                dtypes0[i] = k[1]
    #as of pandas-0.11 nan requires a float64 dtype
    dtypes0.values[dtypes0 == np.int64] = np.dtype('float64')
    return max_len, dtypes0


def txtfile2hdfstore(infile, storefile, table_name, sep="\t", header=0, chunksize=50000 ):
    max_len, dtypes0 = txtfile2dtypes(infile, sep, header, chunksize)

    with pd.get_store( storefile,'w') as store:
        for i, chunk in enumerate(pd.read_table(infile,header=header,sep=sep,chunksize=chunksize, dtype=dict(dtypes0))):
            chunk.index= range( chunksize*(i), chunksize*(i+1))[:chunk.shape[0]]
            store.append(table_name,chunk, min_itemsize={'values':max_len})

Applied as

txtfile2hdfstore('Train.csv','train.h5','train',sep=',')
Community
  • 1
  • 1
user17375
  • 529
  • 4
  • 14
  • You seem to be using HDFStore in a similar fashion to how I would like to use it. I have not had the time to create the wrapper code that handles a lot of the storing and retrieving. Would you mind sharing your `txtfile2dtypes` code? Also, does your data have a lot of character data? I run into issues when storing a csv file into an HDFStore with variable character data. The file-size blows up because I have to set `min_itemsize` to such a large value. I eagerly await the addition of a `truncate` option. – Zelazny7 Apr 09 '13 at 13:54
  • 1
    @Zelazny7 I have updated the thread with the code. Actually I am using it on the same data as yours, the Kaggle's bulldozer thing. I haven't dummyfied the categorical variables to use `sklearn` yet. – user17375 Apr 09 '13 at 14:44
  • 1
    Thank you very much! It looks like your file size balloons in the same way mine does. The ~120mb file ends up over 1GB. I wonder if you or Jeff would know if it's better to store variable length `object` columns (really just strings) using `put` and keeping each text column as it's own HDFStore object. – Zelazny7 Apr 09 '13 at 14:53
  • Actually before you can use your data with a ML algorithm, you will have to binarize the categorical variables. So at the end of the day, you won't care about how large your strings were as long as your HDD can take them. – user17375 Apr 09 '13 at 15:14
  • 1
    My use-case is a bit more hands on. I like to explore the character data before processing it for the modeling step. And there are other ways to handle nominal data, for example by recoding to the target mean. – Zelazny7 Apr 09 '13 at 15:35
  • 1
    Just wanted to give you an update. After much testing I've determined HDFStore is just not going to work. I'm looking for a column-oriented table structure and HDFStore is, unfortunately, not that. So, I'm throwing in with R and going to use the ff package. If you're open to R it might work for you as well: http://cran.r-project.org/web/packages/ff/index.html cheers – Zelazny7 Apr 12 '13 at 14:18
  • Thanks for sharing! If you have a Kaggle login you can get R-Revolution for free. – user17375 Apr 15 '13 at 18:30
  • @Zelazny7 There is an interesting [thread on Quora](https://www.quora.com/Python-programming-language-1/Why-is-Python-a-language-of-choice-for-data-scientists) that clearly points out Python shortcomings when it comes to large dataset. Check Jeff Hammerbacher's answer. – user17375 Apr 21 '13 at 06:31
  • Thanks for the link. I agree entirely. I've been using R with the ff package for a couple weeks now and am starting to develop a comfortable workflow for working with large data. Also, Rstudio makes a pretty decent replacement for iPython in the "R" world. It has many of the same features. – Zelazny7 Apr 21 '13 at 15:05
  • @Zelazny7 coursera is offering an interesting course that seems to discuss [big data in the database perspective](https://www.coursera.org/course/datasci). you might be interested – user17375 May 17 '13 at 06:33

1 Answers1

6

This is a known issue, see the reference here: https://github.com/pydata/pandas/pull/2755

Essentially the query is turned into a numexpr expression for evaluation. There is an issue where I can't pass a lot of or conditions to numexpr (its dependent on the total length of the generated expression).

So I just limit the expression that we pass to numexpr. If it exceeds a certain number of or conditions, then the query is done as a filter, rather than an in-kernel selection. Basically this means the table is read and then reindexed.

This is on my enhancements list: https://github.com/pydata/pandas/issues/2391 (17).

As a workaround, just split your queries up into multiple ones and concat the results. Should be much faster, and use a constant amount of memory

Jeff
  • 125,376
  • 21
  • 220
  • 187
  • 1
    Ok Thanks. I missed the issue thread, I should have searched the github forums first. By the way, I just realize you are the hdfstore developer,so thanks for the great work! – user17375 Apr 09 '13 at 12:59
  • This is pretty obscure and unfortunate easy to miss :) – Jeff Apr 09 '13 at 13:20