I am new to data analysis and python and have virtually no experience with numpy, pytable, pandas etc.
I am reading a csv file to a dataframe chunk by chunk and appending it to an HDFStore as the entire data cannot fit in memory. In my append operations, I am passing a list of data columns via the data_columns parameter so that I can perform out-of-core filtering via HDFStore.select passing conditions to the where parameter.
Is it possible to specify the data columns after the table has been appended?
The reason is that the time to perform chunk by chunk appends to HDFStore tables are increase exponentially on each write. I am guessing that each time an append is made to HDFStore with new data, the index in data columns is recalculated. It would be much more efficient if the data is placed entirely in the store and data columns are specified after creation so that the index is calculated just once.
Appreciate your thoughts and assistance.
UPDATE 1
The following is roughly the code that I am running. Please not I had to sanitize the code to avoid unnecessary disclosure and not provided the entire column list in the following 3 lines of code as there are > 100 columns:
I have determined the dtypes to be read from csv and declared them like this:
csv_dt = {'COL1': dtype('O'), 'TAX': dtype('float64'), 'COL2': dtype('O'), ... }
The following is dtype declaration for the table that is going to be written. Notice I have added the calculated field 'REVENUE' which is not in the original csv.
t_dt = [('COL1', 'S20'), ('COL2', 'S3'), ('COL3', 'f8'), ('TAX', 'f8'), ('REVENUE','f8') ... ]
Determined the max string length of each column to be passed to min_itemsize
. This is only for columns with dtype('O')
str_len = {'COL1': 3, 'COL2': 3, ...}
Here's the code for chunking:
f = HDFStore('FLN2.h5')
for num, chunk in enumerate(read_csv('FLN.csv',dtype=csv_dt, iterator=True, chunksize=250000, parse_dates=True)):
print str(now()),str(num), "Started new chunk"
chunk['MTH'] = to_datetime(chunk.MTH)
chunk.TAX = chunk.TAX.fillna(0)
chunk['REVENUE'] = chunk.NET - chunk.TAX
print str(now()),str(num), "Completed Revenue Calculation"
if num == 0:
f.append('df', chunk, dtype=t_dt, min_itemsize=str_len, dtype=t_dt, data_columns = ['DC1', 'DC2'], expectedrows=42000000)
else:
f.append('df', chunk)
print str(now()),str(num), "Completed writing chunk"
f.flush()
f.flush()
f.close()
This code was running fine yesterday when the chunksize
for reading the csv was set to 50000. However, today when I am trying to run the code I get the following error:
Exception: tables cannot write this data -> rows parameter cannot be converted into a recarray object compliant with table '/df/table (Table(0,))
For the life of me, I can't figure out what happend.
Info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 250000 entries, 0 to 249999
Columns: 122 entries, COL1 to REVENUE
dtypes: datetime64[ns](1), float64(40), int64(13), object(68)
Note the above is only from one chunk written to disk.