I am working with weather data for PV modules. The irradiance dataset (regular timeseries, 1 second data) I've been given shows an issue that occurs often in this field: occasionally, a zero value shows up when it shouldn't (daytime), e.g. due to an instrument or data writing error.
My solution that worked in the past is as below:
df['PoA_corr'] = df['PoA'].replace(0,np.nan).resample('1s').mean().interpolate(method='linear',axis=0).ffill().bfill()
where PoA
: original, with issues, PoA_corr
, my attempt at correcting errors.
However, as can be seen from the image below, not all of the erroneous points have been corrected appropriately: the issue is that the point where PoA == 0
is preceded and followed by 1-4 points that also are incorrect (i.e. the "V" shape in the data, with one point ==0 needs to be replaced by an interpolated line between the pre- and post- "V" points).
I have a few ideas in mind, but am stumped as to which is best, and which would be most pythonic (or able to be made so).
Get a list of indices where
PoA == 0
, look 3 seconds (rows) above, then replace 6-8 s (=6-8 rows) of data. I manage to find the list of points during the day usingbetween_time
and then find the point above using atimedelta
, yet I don't know how to replace/overwrite the subsequent 6-8 rows (or interpolate between point "X-4" and "X+4", where X is the location wherePoA == 0
. The df is large (2.3 GB), so I'm loath to use a for loop on this. At the moment, my list of datetimes where PoA == 0 during day is found as:df.between_time('09:00','16:00').loc[df['PoA']==0]['datetime']
Do some form of moving window on the data, so that if any value within the window == 0, => interpolate between first and last value of the window. Here I'm stumped as to how that could be done.
Is the solution to be found within pandas, or are numpy or pure python advisable?