pandas HDFStore.append data_columns after append operation

Question

I am new to data analysis and python and have virtually no experience with numpy, pytable, pandas etc.

I am reading a csv file to a dataframe chunk by chunk and appending it to an HDFStore as the entire data cannot fit in memory. In my append operations, I am passing a list of data columns via the data_columns parameter so that I can perform out-of-core filtering via HDFStore.select passing conditions to the where parameter.

Is it possible to specify the data columns after the table has been appended?

The reason is that the time to perform chunk by chunk appends to HDFStore tables are increase exponentially on each write. I am guessing that each time an append is made to HDFStore with new data, the index in data columns is recalculated. It would be much more efficient if the data is placed entirely in the store and data columns are specified after creation so that the index is calculated just once.

Appreciate your thoughts and assistance.

UPDATE 1

The following is roughly the code that I am running. Please not I had to sanitize the code to avoid unnecessary disclosure and not provided the entire column list in the following 3 lines of code as there are > 100 columns:

I have determined the dtypes to be read from csv and declared them like this:

csv_dt = {'COL1': dtype('O'), 'TAX': dtype('float64'), 'COL2': dtype('O'), ... }

The following is dtype declaration for the table that is going to be written. Notice I have added the calculated field 'REVENUE' which is not in the original csv.

t_dt = [('COL1', 'S20'), ('COL2', 'S3'), ('COL3', 'f8'), ('TAX', 'f8'), ('REVENUE','f8') ... ]

Determined the max string length of each column to be passed to min_itemsize. This is only for columns with dtype('O')

str_len = {'COL1': 3, 'COL2': 3, ...}

Here's the code for chunking:

f = HDFStore('FLN2.h5')

for num, chunk in enumerate(read_csv('FLN.csv',dtype=csv_dt, iterator=True, chunksize=250000, parse_dates=True)):
    print str(now()),str(num), "Started new chunk"

    chunk['MTH'] = to_datetime(chunk.MTH)
    chunk.TAX = chunk.TAX.fillna(0)
    chunk['REVENUE'] = chunk.NET - chunk.TAX
    print str(now()),str(num), "Completed Revenue Calculation"
    if num == 0:
        f.append('df', chunk, dtype=t_dt, min_itemsize=str_len, dtype=t_dt, data_columns = ['DC1', 'DC2'], expectedrows=42000000)
    else:
        f.append('df', chunk)
    print str(now()),str(num), "Completed writing chunk"
    f.flush()
f.flush()
f.close()

This code was running fine yesterday when the chunksize for reading the csv was set to 50000. However, today when I am trying to run the code I get the following error:

Exception: tables cannot write this data -> rows parameter cannot be converted into a recarray object compliant with table '/df/table (Table(0,))

For the life of me, I can't figure out what happend.

Info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 250000 entries, 0 to 249999
Columns: 122 entries, COL1 to REVENUE
dtypes: datetime64[ns](1), float64(40), int64(13), object(68)

Note the above is only from one chunk written to disk.

can you show us some code? writes are a linear function of the amount of data (with not a large constant). You must specify the data_columns upfront. This determines how the data is structured on disk. Also show a sample of your data, and ``df.info()``. — Jeff, Aug 20 '13 at 19:51
@Jeff Please check the updated question. I can send you ptdump output privately if required as I don't want to expose that data publicly. — smartexpert, Aug 22 '13 at 12:30
@Jeff In terms of write speed, I have attempted to bypass the HDFStore and write the rec array output of the dataframe directly to pytables with dtypes explicitly specified. The write speed performance has significantly increased. Previously, I was writing 1 million rows every ~15 mins. Now its writing 4 million records every 10 mins. For the purposes of my queries, the structure of the table is much leaner. — smartexpert, Aug 22 '13 at 12:34

pandas HDFStore.append data_columns after append operation

0 Answers0