1

How can I select some rows by comparing two columns from hdf5 file using Pandas? The hdf5 file is too big to load into memory. For example, I want to select rows where column A and columns B is equal. The dataframe is save in file 'mydata.hdf5'. Thanks.

import pandas as pd
store = pd.HDFstore('mydata.hdf5')
df = store.select('mydf',where='A=B')

This doesn't work. I know that store.select('mydf',where='A==12') will work. But I want to compare column A and B. The example data looks like this:

A B C 
1 1 3
1 2 4
. . .
2 2 5
1 3 3
Lee
  • 11
  • 1
  • Is this a typo: `df = store.select('mydf',where='A=B')`? shouldn't it be: `df = store.select('mydf',where='A==B')` i.e. 2 `==`s? – EdChum Dec 08 '14 at 09:17
  • @EdChum, thanks for your quick reply. But both of 'A=B' and 'A==B' is OK. The error is as below: c:\python33\lib\site-packages\pandas\computation\pytables.py in convert_value(self, v) 184 return TermValue(int(v), v, kind) 185 elif kind == u('integer'): --> 186 v = int(float(v)) 187 return TermValue(v, v, kind) 188 elif kind == u('float'): ValueError: could not convert string to float: 'chr_b' – Lee Dec 08 '14 at 09:27

1 Answers1

1

You cannot directly do this, but the following will work

In [23]: df = DataFrame({'A' : [1,2,3], 'B' : [2,2,2]})

In [24]: store = pd.HDFStore('test.h5',mode='w')

In [26]: store.append('df',df,data_columns=True)

In [27]: store.select('df')
Out[27]: 
   A  B
0  1  2
1  2  2
2  3  2

In [28]: store.select_column('df','A') == store.select_column('df','B')
Out[28]: 
0    False
1     True
2    False
dtype: bool

This should be pretty efficient.

Jeff
  • 125,376
  • 21
  • 220
  • 187
  • This works for the simple case. But if we have more columns to compare, how to solve it? data = store.select('df', store.select_column('df','A') == store.select_column('df','B') & store.select_column('df','C') < store.select_column('df','D') & store.select_column('df','E')==2 ) . This may be not compact and efficient since we call .select_column several times, right? – Lee Dec 08 '14 at 12:27
  • I think comparsion between columns is quite ofen used. But I can't find any related document. Pandas has not implement this function, right? – Lee Dec 08 '14 at 12:32
  • simply read it into memory and do it directly then. This is not implemented by PyTables. So no easy way to do this. – Jeff Dec 08 '14 at 13:00