5

I am getting the following error after using pandas.HDFStore().append()

ValueError: Trying to store a string with len [150] in [values_block_0] column but  this column has a limit of [127]!

Consider using min_itemsize to preset the sizes on these columns

I am creating a pandas DataFrame and appending it to the HDF5 file as follows:

import pandas as pd

store = pd.HDFStore("test1.h5", mode='w')

hdf_key = "one_key"

columns = ["col1", "col2", ... ]

df = pd.Dataframe(...)
df.col1 = df.col1.astype(str)
df.col2 = df.col2astype(int)
df.col3 = df.col3astype(str)
.... 
store.append(hdf_key, df, data_column=columns, index=False)

I get the error above: "ValueError: Trying to store a string with len [150] in [values_block_0] column but this column has a limit of [127]!"

Afterwards, I execute the code:

store.get_storer(hdf_key).table.description

which outputs

{
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": StringCol(itemsize=127, shape=(5,), dflt=b'', pos=1),
  "values_block_1": Int64Col(shape=(5,), dflt=0, pos=2),
  "col1": StringCol(itemsize=20, shape=(), dflt=b'', pos=3),
  "col2": StringCol(itemsize=39, shape=(), dflt=b'', pos=4)}

What are values_block_0 and values_block_1?

So, following this StackOverflow Pandas pytable: how to specify min_itemsize of the elements of a MultiIndex , I tried

store.append(hdf_key, df, data_column=columns, index=False,  min_itemsize={"values_block_0":250})

This doesn't work though---now I get this error:

ValueError: Trying to store a string with len [250] in [values_block_0] column but  this column has a limit of [127]!

Consider using min_itemsize to preset the sizes on these columns

What am I doing wrong?

EDIT: This code produces the error ValueError: min_itemsize has the key [values_block_0] which is not an axis or data_column from filename.py

import pandas as pd
store = pd.HDFStore("test1.h5", mode='w')
hdf_key = "one_key"

my_columns = ["col1", "col2", ... ]

df = pd.Dataframe(...)
df.col1 = df.col1.astype(str)
df.col2 = df.col2astype(int)
df.col3 = df.col3astype(str)
.... 
store.append(hdf_key, df, data_column=my_columns, index=False, min_itemsize={"values_block_0":350})

Here is the full error:

(python-3) -bash:1008 $ python filename.py
Traceback (most recent call last):
  File "filename.py", line 50, in <module>
    store.append(hdf_key, dicts_into_df,  data_column=my_columns, index=False, min_itemsize={'values_block_0':350})
  File "/path/lib/python-3/lib/python3.5/site-packages/pandas/io/pytables.py", line 970, in append
    **kwargs)
  File "/path/lib/python-3/lib/python3.5/site-packages/pandas/io/pytables.py", line 1315, in _write_to_group
    s.write(obj=value, append=append, complib=complib, **kwargs)
  File "/path/lib/python-3/lib/python3.5/site-packages/pandas/io/pytables.py", line 4263, in write
    obj=obj, data_columns=data_columns, **kwargs)
  File "/path/lib/python-3/lib/python3.5/site-packages/pandas/io/pytables.py", line 3853, in write
    **kwargs)
  File "/path/lib/python-3/lib/python3.5/site-packages/pandas/io/pytables.py", line 3535, in create_axes
    self.validate_min_itemsize(min_itemsize)
  File "/path/lib/python-3/lib/python3.5/site-packages/pandas/io/pytables.py", line 3174, in validate_min_itemsize
    "data_column" % k)
ValueError: min_itemsize has the key [values_block_0] which is not an axis or data_column
hpaulj
  • 221,503
  • 14
  • 230
  • 353
ShanZhengYang
  • 16,511
  • 49
  • 132
  • 234

2 Answers2

4

UPDATE:

you have misspelled data_columns parameter: data_column - it should be data_columns. As a result you didn't have any indexed columns in your HDF Store and HDF store added values_block_X:

In [70]: store = pd.HDFStore(r'D:\temp\.data\my_test.h5')

misspelled parameters will be ignored:

In [71]: store.append('no_idx_wrong_dc', df, data_column=df.columns, index=False)

In [72]: store.get_storer('no_idx_wrong_dc').table
Out[72]:
/no_idx_wrong_dc/table (Table(10,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "values_block_1": Int64Col(shape=(1,), dflt=0, pos=2),
  "values_block_2": StringCol(itemsize=30, shape=(1,), dflt=b'', pos=3)}
  byteorder := 'little'
  chunkshape := (1213,)

is the same as the following:

In [73]: store.append('no_idx_no_dc', df, index=False)

In [74]: store.get_storer('no_idx_no_dc').table
Out[74]:
/no_idx_no_dc/table (Table(10,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "values_block_1": Int64Col(shape=(1,), dflt=0, pos=2),
  "values_block_2": StringCol(itemsize=30, shape=(1,), dflt=b'', pos=3)}
  byteorder := 'little'
  chunkshape := (1213,)

let's spell it correctly:

In [75]: store.append('no_idx_dc', df, data_columns=df.columns, index=False)

In [76]: store.get_storer('no_idx_dc').table
Out[76]:
/no_idx_dc/table (Table(10,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "value": Float64Col(shape=(), dflt=0.0, pos=1),
  "count": Int64Col(shape=(), dflt=0, pos=2),
  "s": StringCol(itemsize=30, shape=(), dflt=b'', pos=3)}
  byteorder := 'little'
  chunkshape := (1213,)

OLD Answer:

AFAIK you can effectively set the min_itemsize parameter on the first append only.

Demo:

In [33]: df
Out[33]:
   num                 s
0   11  aaaaaaaaaaaaaaaa
1   12    bbbbbbbbbbbbbb
2   13     ccccccccccccc
3   14       ddddddddddd

In [34]: store = pd.HDFStore(r'D:\temp\.data\my_test.h5')

In [35]: store.append('test_1', df, data_columns=True)

In [36]: store.get_storer('test_1').table.description
Out[36]:
{
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "num": Int64Col(shape=(), dflt=0, pos=1),
  "s": StringCol(itemsize=16, shape=(), dflt=b'', pos=2)}

In [37]: df.loc[4] = [15, 'X'*200]

In [38]: df
Out[38]:
   num                                                  s
0   11                                   aaaaaaaaaaaaaaaa
1   12                                     bbbbbbbbbbbbbb
2   13                                      ccccccccccccc
3   14                                        ddddddddddd
4   15  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...

In [39]: store.append('test_1', df, data_columns=True)
...
skipped
...
ValueError: Trying to store a string with len [200] in [s] column but
this column has a limit of [16]!
Consider using min_itemsize to preset the sizes on these columns    

now using min_itemsize, but still appending to the existing store object:

In [40]: store.append('test_1', df, data_columns=True, min_itemsize={'s':250})
...
skipped
...
ValueError: Trying to store a string with len [250] in [s] column but
this column has a limit of [16]!
Consider using min_itemsize to preset the sizes on these columns

The following works if we will create a new object in our store:

In [41]: store.append('test_2', df, data_columns=True, min_itemsize={'s':250})

Check column sizes:

In [42]: store.get_storer('test_2').table.description
Out[42]:
{
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "num": Int64Col(shape=(), dflt=0, pos=1),
  "s": StringCol(itemsize=250, shape=(), dflt=b'', pos=2)}
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • Thanks. I am still slightly confused how to implement this solution when iterating through multiple dataframes and appending? `for chunk in pd.csv_reader(): store.append(key, chunk, data_columns)` or `for i in range: df=pd.Dataframe(); store.append(key, chunk, data_columns)` like answer here: http://stackoverflow.com/questions/39925077/how-do-i-combine-multiple-pandas-dataframes-into-an-hdf5-object-under-one-key-gr It appears you run the script. If there's an error, `store.append` on a new key. – ShanZhengYang Oct 10 '16 at 13:44
  • @ShanZhengYang, you either need to know the maximum length of the `values_block_0` column or use a value which will for sure be able to hold the max. length, for example: `min_itemsize={"values_block_0":1000}` – MaxU - stand with Ukraine Oct 10 '16 at 14:53
  • The problem with this approach (i.e. using `min_itemsize={"values_block_0":1000}`) is I get this error: `ValueError: min_itemsize has the key [values_block_0] which is not an axis or data_column`. Only after the first error is raised `ValueError: Trying to store a string with len [200] in [values_block_0] column but this column has a limit of [16]!` does it appears that `values_block_0` is recognized as a column – ShanZhengYang Oct 10 '16 at 15:49
  • should I be using a different value than `value_block_0`? – ShanZhengYang Oct 10 '16 at 15:52
  • @ShanZhengYang, can you post a code producing `ValueError: min_itemsize has the key [values_block_0] which is not an axis or data_column`? – MaxU - stand with Ukraine Oct 10 '16 at 15:59
  • Edite the above. I could send you the actual script, but that's basically it. – ShanZhengYang Oct 10 '16 at 16:12
  • @ShanZhengYang, it means that your `dicts_into_df` DF doesn't have `values_block_0` at the moment you call `store.append(...)` – MaxU - stand with Ukraine Oct 10 '16 at 16:14
  • I see. So, this should be named first? – ShanZhengYang Oct 10 '16 at 16:16
  • @ShanZhengYang, your DF should have a column that you are specifying in the `min_itemsize` parameter – MaxU - stand with Ukraine Oct 10 '16 at 16:17
  • It isn't a column that I have provided. I have edited the above. Could you explain to me what `values_block_0` and `values_block_1` are? These values are not columns within the dataframe `df` – ShanZhengYang Oct 10 '16 at 16:35
  • This works. Silly mistake of mine. Thank you! However, there is still an issue that if you define the `min_itemsize` for each column, the could be a `ValueError` on the first append, like `ValueError: min_itemsize has the key [col_20] which is not an axis or data_column`. I'm guessing this is before this column is read? It must be considered a column as it is included in `data_columns` – ShanZhengYang Oct 10 '16 at 17:24
  • @ShanZhengYang, i think i've answered your original question. Please open a new one as your question and my answer are already pretty overloaded... – MaxU - stand with Ukraine Oct 10 '16 at 17:27
  • Yes, you answered this. Thanks! I was clarifying a pandas functional issue. Thank you! I appreciate the help – ShanZhengYang Oct 10 '16 at 17:29
1

I started to get this error around about the same time as updating Pandas from 18.1 to 22.0 (although this could be unrelated).

I fixed the error in the existing HDF5 file by manually reading the dataframe in, then writing a new HDF5 file with a larger min_itemsize for the column mentioned in the error:

filename_hdf5 = "C:\test.h5"
df = pd.read_hdf(filename_hdf5, 'table_name')
hdf = HDFStore(filename_hdf5)
hdf.put('table_name', df, format='table', data_columns=True, min_itemsize={'ColumnNameMentionedInError': 10})
hdf.close()

I then updated the existing code to set min_itemsize on key creation.


Extra for Experts

The error occurs because one is trying to append more rows to an existing dataframe with a fixed column width too narrow for the new data. The fixed column width was originally set based on the longest string in the column when the dataframe was first written.

Methinks that pandas should handle this error transparently, rather than leaving what is effectively a timebomb for all future appends. This issue could take weeks or even years to surface.

Contango
  • 76,540
  • 58
  • 260
  • 305