6

Lots of information on how to read a csv into a pandas dataframe, but I what I have is a pyTable table and want a pandas DataFrame.

I've found how to store my pandas DataFrame to pytables... then read I want to read it back, at this point it will have:

"kind = v._v_attrs.pandas_type"  

I could write it out as csv and re-read it in but that seems silly. It is what I am doing for now.

How should I be reading pytable objects into pandas?

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
Jim Knoll
  • 115
  • 2
  • 6

2 Answers2

7
import tables as pt
import pandas as pd
import numpy as np

# the content is junk but we don't care
grades = np.empty((10,2), dtype=(('name', 'S20'), ('grade', 'u2')))

# write to a PyTables table
handle = pt.openFile('/tmp/test_pandas.h5', 'w')
handle.createTable('/', 'grades', grades)
print handle.root.grades[:].dtype # it is a structured array

# load back as a DataFrame and check types
df = pd.DataFrame.from_records(handle.root.grades[:])
df.dtypes

Beware that your u2 (unsigned 2-byte integer) will end as an i8 (integer 8 byte), and the strings will be objects, because Pandas does not yet support the full range of dtypes that are available for Numpy arrays.

meteore
  • 4,635
  • 2
  • 22
  • 14
  • thanks but how does this read data from a non pandas h5 file into a pandas h5 file? It looks like it just puts random data into a pandas h5 file. I can read my source table like this 'for rec in table:' but the table is not a pandas h5 file it is just a pytable table so it fails as the pandas source because 'kind' is not 'pandas_type.' – Jim Knoll Oct 17 '12 at 13:53
  • Wait I spend some more time with this... are you saying all I need to do is add a structured array with extra data type info to my existing pytables table and then it will inport to pandas df? I really only know how to work with pyTables ... It keeps data type info in attributes on the leaf object. if I have this correct how does pandas associate to two leaf objects. (one with data type info, one with the table of data) – Jim Knoll Oct 17 '12 at 14:42
  • import numpy as np grades = np.empty((10,2), dtype=(('name', 'S20'), ('grade', 'u2'))) This must be a bug python does not understand the code – Jim Knoll Oct 17 '12 at 15:23
  • Sorry, you're right: you have to use a list (`[]`) to group the dtype specification, not a tuple (`()`). – meteore Oct 19 '12 at 08:45
  • As to your other questions, I have trouble understanding what you want. I understand the original post as 'I have a PyTables table and I want a Pandas DataFrame with the correct types'. The answer shows that there's no messing with the _v_attrs to do, since PyTables tables load to record arrays whose dtype specifications are understood by Pandas, even if later Pandas only supports 8-byte integers, 8-byte floats, and objects, instead of the full wealth of [numpy dtypes](http://docs.scipy.org/doc/numpy/user/basics.types.html) – meteore Oct 19 '12 at 08:48
  • I was hoping to get this to work but a pyTable table where does not provide a len df = pd.DataFrame.from_records(handle.root.my_table.where('my_field == "some string"')) – Jim Knoll Nov 15 '12 at 20:32
  • TypeError: object of type 'tables.tableExtension.Row' has no len() – Jim Knoll Nov 15 '12 at 20:38
5

The docs now include an excellent section on using the HDF5 store and there are some more advanced strategies discussed in the cookbook.

It's now relatively straightforward:

In [1]: store = HDFStore('store.h5')

In [2]: print store
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
Empty

In [3]: df = DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])

In [4]: store['df'] = df

In [5]: store
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
/df            frame        (shape->[2,2])

And to retrieve from HDF5/pytables:

In [6]: store['df']  # store.get('df') is an equivalent
Out[6]:
   A  B
0  1  2
1  3  4

You can also query within a table.

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535