4

Can you store data as pandas HDFStore and open them / perform i/o using pytables? The reason this question comes up is because I am currently storing data as

pd.HDFStore('Filename',mode='a')
store.append(data)

However, as i understand pandas doesn't support updating records so much. I have a usecase where I have to update 5% of the data daily. Would pd.io.pytables work? if so I found no documentation on this? Pytables has a lot of documentation but i am not sure if i can open the file / update without opening using pytables when i didnt use pytables to save the file initially?

piRSquared
  • 285,575
  • 57
  • 475
  • 624
CodeGeek123
  • 4,341
  • 8
  • 50
  • 79
  • Why is it tagged with `r`? – akrun Jan 11 '17 at 15:55
  • When you say "pandas doesn't support updating records", are you referring to updating in place? Couldn't you just load the table from HDFStore into a df, update the 5%, and write the whole table back (change the mode to 'w' instead of append)? – flyingmeatball Jan 11 '17 at 16:16
  • The table is too big to be loaded into memory entirely. If there is some way around this i can do that? – CodeGeek123 Jan 11 '17 at 16:33

2 Answers2

2

Here are the docs I think you're after:

http://pandas.pydata.org/pandas-docs/version/0.19.0/api.html?highlight=pytables

See this thread as well:

Update pandas DataFrame in stored in a Pytable with another pandas DataFrame

Looks like you can load the 5% records into memory, remove them from the store then append the updated ones back

to replace the whole table
store.remove(key, where = ...) store.append(.....)

You can also do outside of Pandas - see tutorial here on removal

http://www.pytables.org/usersguide/tutorials.html

Community
  • 1
  • 1
flyingmeatball
  • 7,457
  • 7
  • 44
  • 62
2

Here is a demonstration for the @flyingmeatball's answer:

Let's generate a test DF:

In [56]: df = pd.DataFrame(np.random.rand(15, 3), columns=list('abc'))

In [57]: df
Out[57]:
           a         b         c
0   0.022079  0.901965  0.282529
1   0.596452  0.096204  0.197186
2   0.034127  0.992500  0.523114
3   0.659184  0.447355  0.246932
4   0.441517  0.853434  0.119602
5   0.779707  0.429574  0.744452
6   0.105255  0.934440  0.545421
7   0.216278  0.217386  0.282171
8   0.690729  0.052097  0.146705
9   0.828667  0.439608  0.091007
10  0.988435  0.326589  0.536904
11  0.687250  0.661912  0.318209
12  0.829129  0.758737  0.519068
13  0.500462  0.723528  0.026962
14  0.464162  0.364536  0.843899

and save it to HDFStore (NOTE: don't forget to use data_columns=True (or data_columns=[list_of_columns_to_index]) in order to index all columns, that we want to use in the where clause):

In [58]: store = pd.HDFStore(r'd:/temp/test_removal.h5')

In [59]: store.append('test', df, format='t', data_columns=True)

In [60]: store.close()

Solution:

In [61]: store = pd.HDFStore(r'd:/temp/test_removal.h5')

The .remove() method should return # of removed rows:

In [62]: store.remove('test', where="a > 0.5")
Out[62]: 9

Let's append changed (multiplied by 100) rows :

In [63]: store.append('test', df.loc[df.a > 0.5] * 100, format='t', data_columns=True)

Test:

In [64]: store.select('test')
Out[64]:
            a          b          c
0    0.022079   0.901965   0.282529
2    0.034127   0.992500   0.523114
4    0.441517   0.853434   0.119602
6    0.105255   0.934440   0.545421
7    0.216278   0.217386   0.282171
14   0.464162   0.364536   0.843899
1   59.645151   9.620415  19.718557
3   65.918421  44.735482  24.693160
5   77.970749  42.957446  74.445185
8   69.072948   5.209725  14.670545
9   82.866731  43.960848   9.100682
10  98.843540  32.658931  53.690360
11  68.725002  66.191215  31.820942
12  82.912937  75.873689  51.906795
13  50.046189  72.352794   2.696243

finalize:

In [65]: store.close()
Community
  • 1
  • 1
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419