1

I have a CSV file with 3GB. I'm trying to save it to HDF format with Pandas so I can load it faster.

import pandas as pd
import traceback

df_all = pd.read_csv('file_csv.csv', iterator=True, chunksize=20000)

for _i, df in enumerate(df_all):
    try:
        print ('Saving %d chunk...' % _i, end='')
        df.to_hdf('file_csv.hdf',
                  'file_csv',
                  format='table',
                  data_columns=True)
        print ('Done!')
    except:
        traceback.print_exc()
        print (df)
        print (df.info())

del df_all

The original CSV file has about 3 million rows, which is reflected by the output of this piece of code. The last line of output is: Saving 167 chunk...Done! That means: 167*20000 = 3.340.000 rows

My issue is:

df_hdf = pd.read_hdf('file_csv.hdf')
df_hdf.count()

=> 4613 rows

And:

item_info = pd.read_hdf('ItemInfo_train.hdf', where="item=1")

Returns nothing, even I'm sure the "item" column has an entry equals to 1 in the original file.

What can be wrong?

Frias
  • 10,991
  • 9
  • 33
  • 40

1 Answers1

0

Use append=True to tell to_hdf to append new chunks to the same file.

    df.to_hdf('file_csv.hdf', ..., append=True)

Otherwise, each call overwrites the previous contents and only the last chunk remains saved in file_csv.hdf.


import os
import numpy as np
import pandas as pd

np.random.seed(2016)
df = pd.DataFrame(np.random.randint(10, size=(100, 2)), columns=list('AB'))
df.to_csv('file_csv.csv')
if os.path.exists('file_csv.hdf'): os.unlink('file_csv.hdf')
for i, df in enumerate(pd.read_csv('file_csv.csv', chunksize=50)):
    print('Saving {} chunk...'.format(i), end='')
    df.to_hdf('file_csv.hdf',
              'file_csv', 
              format='table',
              data_columns=True,
              append=True)
    print('Done!')
    print(df.loc[df['A']==1])
print('-'*80)
df_hdf = pd.read_hdf('file_csv.hdf', where="A=1")
print(df_hdf)

prints

    Unnamed: 0  A  B
22          22  1  7
30          30  1  7
41          41  1  9
44          44  1  0
19          69  1  3
29          79  1  1
31          81  1  5
34          84  1  6

Use append=True to tell to_hdf to append new chunks to the same file. Otherwise, only the last chunk is saved in file_csv.hdf:

import os
import numpy as np
import pandas as pd

np.random.seed(2016)
df = pd.DataFrame(np.random.randint(10, size=(100, 2)), columns=list('AB'))
df.to_csv('file_csv.csv')
if os.path.exists('file_csv.hdf'): os.unlink('file_csv.hdf')
for i, df in enumerate(pd.read_csv('file_csv.csv', chunksize=50)):
    print('Saving {} chunk...'.format(i), end='')
    df.to_hdf('file_csv.hdf',
              'file_csv', 
              format='table',
              data_columns=True,
              append=True)
    print('Done!')
    print(df.loc[df['A']==1])
print('-'*80)
df_hdf = pd.read_hdf('file_csv.hdf', where="A=1")
print(df_hdf)

prints

    Unnamed: 0  A  B
22          22  1  7
30          30  1  7
41          41  1  9
44          44  1  0
19          69  1  3
29          79  1  1
31          81  1  5
34          84  1  6
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677