I have a CSV file with 3GB. I'm trying to save it to HDF format with Pandas so I can load it faster.
import pandas as pd
import traceback
df_all = pd.read_csv('file_csv.csv', iterator=True, chunksize=20000)
for _i, df in enumerate(df_all):
try:
print ('Saving %d chunk...' % _i, end='')
df.to_hdf('file_csv.hdf',
'file_csv',
format='table',
data_columns=True)
print ('Done!')
except:
traceback.print_exc()
print (df)
print (df.info())
del df_all
The original CSV file has about 3 million rows, which is reflected by the output of this piece of code. The last line of output is: Saving 167 chunk...Done! That means: 167*20000 = 3.340.000 rows
My issue is:
df_hdf = pd.read_hdf('file_csv.hdf')
df_hdf.count()
=> 4613 rows
And:
item_info = pd.read_hdf('ItemInfo_train.hdf', where="item=1")
Returns nothing, even I'm sure the "item" column has an entry equals to 1 in the original file.
What can be wrong?