Storing as Pandas DataFrames and Updating as Pytables

Question

Can you store data as pandas HDFStore and open them / perform i/o using pytables? The reason this question comes up is because I am currently storing data as

pd.HDFStore('Filename',mode='a')
store.append(data)

However, as i understand pandas doesn't support updating records so much. I have a usecase where I have to update 5% of the data daily. Would pd.io.pytables work? if so I found no documentation on this? Pytables has a lot of documentation but i am not sure if i can open the file / update without opening using pytables when i didnt use pytables to save the file initially?

When you say "pandas doesn't support updating records", are you referring to updating in place? Couldn't you just load the table from HDFStore into a df, update the 5%, and write the whole table back (change the mode to 'w' instead of append)? — flyingmeatball, Jan 11 '17 at 16:16
The table is too big to be loaded into memory entirely. If there is some way around this i can do that? — CodeGeek123, Jan 11 '17 at 16:33

score 2 · Answer 1 · edited May 23 '17 at 11:55

2

Here are the docs I think you're after:

http://pandas.pydata.org/pandas-docs/version/0.19.0/api.html?highlight=pytables

See this thread as well:

Update pandas DataFrame in stored in a Pytable with another pandas DataFrame

Looks like you can load the 5% records into memory, remove them from the store then append the updated ones back

to replace the whole table
store.remove(key, where = ...) store.append(.....)

You can also do outside of Pandas - see tutorial here on removal

http://www.pytables.org/usersguide/tutorials.html

edited May 23 '17 at 11:55

Community

1
1

answered Jan 11 '17 at 16:49

flyingmeatball

7,457
7
44
62

1

I think we can remove only those 5% of data using `where` clause: `store.remove('key_name', where='')` – MaxU - stand with Ukraine Jan 11 '17 at 17:00

score 2 · Accepted Answer · edited May 23 '17 at 10:30

Here is a demonstration for the @flyingmeatball's answer:

Let's generate a test DF:

In [56]: df = pd.DataFrame(np.random.rand(15, 3), columns=list('abc'))

In [57]: df
Out[57]:
           a         b         c
0   0.022079  0.901965  0.282529
1   0.596452  0.096204  0.197186
2   0.034127  0.992500  0.523114
3   0.659184  0.447355  0.246932
4   0.441517  0.853434  0.119602
5   0.779707  0.429574  0.744452
6   0.105255  0.934440  0.545421
7   0.216278  0.217386  0.282171
8   0.690729  0.052097  0.146705
9   0.828667  0.439608  0.091007
10  0.988435  0.326589  0.536904
11  0.687250  0.661912  0.318209
12  0.829129  0.758737  0.519068
13  0.500462  0.723528  0.026962
14  0.464162  0.364536  0.843899

and save it to HDFStore (NOTE: don't forget to use data_columns=True (or data_columns=[list_of_columns_to_index]) in order to index all columns, that we want to use in the where clause):

In [58]: store = pd.HDFStore(r'd:/temp/test_removal.h5')

In [59]: store.append('test', df, format='t', data_columns=True)

In [60]: store.close()

Solution:

In [61]: store = pd.HDFStore(r'd:/temp/test_removal.h5')

The .remove() method should return # of removed rows:

In [62]: store.remove('test', where="a > 0.5")
Out[62]: 9

Let's append changed (multiplied by 100) rows :

In [63]: store.append('test', df.loc[df.a > 0.5] * 100, format='t', data_columns=True)

Test:

In [64]: store.select('test')
Out[64]:
            a          b          c
0    0.022079   0.901965   0.282529
2    0.034127   0.992500   0.523114
4    0.441517   0.853434   0.119602
6    0.105255   0.934440   0.545421
7    0.216278   0.217386   0.282171
14   0.464162   0.364536   0.843899
1   59.645151   9.620415  19.718557
3   65.918421  44.735482  24.693160
5   77.970749  42.957446  74.445185
8   69.072948   5.209725  14.670545
9   82.866731  43.960848   9.100682
10  98.843540  32.658931  53.690360
11  68.725002  66.191215  31.820942
12  82.912937  75.873689  51.906795
13  50.046189  72.352794   2.696243

finalize:

In [65]: store.close()

Storing as Pandas DataFrames and Updating as Pytables

2 Answers2