I am trying to create a function that updates a pandas DataFrame stored that I have stored in a PyTable with new data from a pandas DataFrame. I want to check if some data is missing in the PyTable for specific DatetimeIndexes (value is NaN or a new Timestamp is available), replace this with new values from a given pandas DataFrame and append this to the Pytable. Basically, just update a Pytable. I can get the combined DataFrame using the combine_first method in Pandas. Below the Pytable is created with dummy data:
import pandas as pd import numpy as np import datetime as dt index = pd.DatetimeIndex(start = dt.datetime(2001,1,1,0,0), periods = 20000,freq='10T') data_in_pytable = pd.DataFrame(index=index,data=np.random.randn(20000,2),columns=['value_1','value_2']) data.to_hdf(r'C:\pytable.h5','test',mode='r+',append=True,complevel=9,complib='zlib')
So the pytable is created. Assuming I have another dataFrame with which I want to update the Pytable with:
new_index = pd.DatetimeIndex(start = dt.datetime(2001,5,1,0,0), periods = 10000,freq='10T') data_to_update=pd.DataFrame(index=new_index,data=np.random.randn(10000,2),columns=['value_1','value_2']) store=pd.HDFStore(r'C:\pytable.h5',mode='r+',complevel=9,complib='zlib') store.append('test',store.select('test').combine_first(data_to_update)) store.close()
The problem is that the PyTable keeps the original values, does not update the existing ones. I now have duplicate entries (by index) because the original values are not overwritten.
Summary: How can I update a PyTable with another DataFrame?
Thanks, Elv