1

Is there a way to force pandas to write an empty DataFrame to an HDF file?

import pandas as pd
df = pd.DataFrame(columns=['x','y'])
df.to_hdf('temp.h5', 'xxx')
df2 = pd.read_hdf('temp.h5', 'xxx') 

Output:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../Python-3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 389, in read_hdf
    return store.select(key, auto_close=auto_close, **kwargs)
  File ".../Python-3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 740, in select
    return it.get_result()
  File ".../Python-3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 1518, in get_result
    results = self.func(self.start, self.stop, where)
  File ".../Python-3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 733, in func
    columns=columns)
  File ".../Python-3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 2986, in read
    idx=i), start=_start, stop=_stop)
  File ".../Python-3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 2575, in read_index
    _, index = self.read_index_node(getattr(self.group, key), **kwargs)
  File ".../Python-3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 2676, in read_index_node
    data = node[start:stop]
  File ".../Python-3.6.3/lib/python3.6/site-packages/tables/vlarray.py", line 675, in __getitem__
    return self.read(start, stop, step)
  File ".../Python-3.6.3/lib/python3.6/site-packages/tables/vlarray.py", line 811, in read
    listarr = self._read_array(start, stop, step)
  File "tables/hdf5extension.pyx", line 2106, in tables.hdf5extension.VLArray._read_array (tables/hdf5extension.c:24649)
ValueError: cannot set WRITEABLE flag to True of this array

Writing with format='table':

import pandas as pd
df = pd.DataFrame(columns=['x','y'])
df.to_hdf('temp.h5', 'xxx', format='table')
df2 = pd.read_hdf('temp.h5', 'xxx')

Output:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../Python-3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 389, in read_hdf
    return store.select(key, auto_close=auto_close, **kwargs)
  File ".../Python-3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 722, in select
    raise KeyError('No object named {key} in the file'.format(key=key))
KeyError: 'No object named xxx in the file'

Pandas version: 0.24.2

Thank you for your help!

S.V
  • 2,149
  • 2
  • 18
  • 41
  • I don't use `pandas`, so working from their docs. Their examples all use `key='xxx'` to define the group identifier. It's not clear if the value can also be taken as a positional argument. – kcw78 Mar 25 '19 at 14:33
  • I have just tried to use 'yyy' as a key with the same results. So, it does not look like the problem is in the key name. – S.V Mar 26 '19 at 13:03
  • It seems to be the current "feature" of pandas: empty FataFrames are intentionally not written to HDF files: [pandas issue #13016](https://github.com/pandas-dev/pandas/issues/13016) – S.V Mar 26 '19 at 13:03
  • What are you trying to accomplish writing an empty DataFrame to HDF5? Maybe there's another way to do it (natively with pytables)? I saw a patch is mentioned in the pandas issue (pytables.py#L1365). Have you considered that? – kcw78 Mar 26 '19 at 14:04
  • @kcw78 My data sets are very large (some are dozens of TB), and so I partition my data. However, some partitions might end up having no data of specific type, which results in an empty DataFrame written to a file (i.e. in a failed attempt to write an empty DataFrame). When I read such a partition, my code crashes since the corresponding key is not found in a file. Of course, I could insert everywhere safe guards, which would check for a presence of a key in a partition before it is read, but it would make my life easier if I could just write/read empty DataFrames to/from partitions. – S.V Apr 04 '19 at 16:40
  • @kcw78 The comment about patching pytables.py#L1365 in the end of the pandas issue page is from me. In general, I would prefer to not have to patch pandas code since it would make upgrading pandas and making sure that my patches do not break anything much more difficult and error prone. For now, I prefer to work around this pandas "feature". – S.V Apr 04 '19 at 16:49

1 Answers1

0

Putting empty DataFrame into HDFStore in fixed format should work (maybe you need to check versions of other packages, e.g. tables):

# Versions
pd.__version__
tables.__version__

# DF
df = pd.DataFrame(columns=['x','y'])
df

# Dump in fixed format
with pd.HDFStore('temp.h5') as store:
    store.put('df', df, format='f')
    print('Read:')
    store.select('df')

>>> '0.24.2'
>>> '3.5.1'
>>>   x     y
>>>
>>> Read:
>>>   x     y

Pytable really forbids to do so (at least it was), but for fixed pandas has its workaround.

But as discussed in same github issue there are made some efforts to fix this behavior for table as well. But looks like solution is still 'hangs in the air' because it was so at the end of march.

Xronx
  • 1,160
  • 7
  • 23