pandas limited resample / windowed replacing of multiple rows of values

Question

I am working with weather data for PV modules. The irradiance dataset (regular timeseries, 1 second data) I've been given shows an issue that occurs often in this field: occasionally, a zero value shows up when it shouldn't (daytime), e.g. due to an instrument or data writing error.

My solution that worked in the past is as below:

df['PoA_corr'] = df['PoA'].replace(0,np.nan).resample('1s').mean().interpolate(method='linear',axis=0).ffill().bfill()

where PoA: original, with issues, PoA_corr, my attempt at correcting errors.

However, as can be seen from the image below, not all of the erroneous points have been corrected appropriately: the issue is that the point where PoA == 0 is preceded and followed by 1-4 points that also are incorrect (i.e. the "V" shape in the data, with one point ==0 needs to be replaced by an interpolated line between the pre- and post- "V" points).

I have a few ideas in mind, but am stumped as to which is best, and which would be most pythonic (or able to be made so).

Get a list of indices where PoA == 0, look 3 seconds (rows) above, then replace 6-8 s (=6-8 rows) of data. I manage to find the list of points during the day using between_time and then find the point above using a timedelta, yet I don't know how to replace/overwrite the subsequent 6-8 rows (or interpolate between point "X-4" and "X+4", where X is the location where PoA == 0. The df is large (2.3 GB), so I'm loath to use a for loop on this. At the moment, my list of datetimes where PoA == 0 during day is found as:
```
df.between_time('09:00','16:00').loc[df['PoA']==0]['datetime']
```
Do some form of moving window on the data, so that if any value within the window == 0, => interpolate between first and last value of the window. Here I'm stumped as to how that could be done.

Is the solution to be found within pandas, or are numpy or pure python advisable?

`df.interpolate()` will automatically interpolate any number of `nan`'s in a row. I would just set a threshold value `delta` such that if `(x1 - x2) > delta` then replace that value with `nan`. That should replace all V-shapes with `nan`'s given an appropriate threshold choice. Then just use `df.interpolate()` and voila! — Frodnar, Apr 09 '21 at 22:17
This is indeed the first approach I considered, yet irradiance (sunlight) sees some rapid changes, where it is not clear from e.g. using `df['PoA'].diff()` whether a change is due to natural reasons (clouds moving), or data errors. The main difference I see so far is that data errors here have at least one point that is equal to zero. E.g.: with cloud: from 400 -> 200. Data error: from 400 -> 0, or from 200 -> 0. What if there is only one zero (or `nan`) in the "V", instead of multiple such values? — Monte Cristo, Apr 10 '21 at 10:05

pandas limited resample / windowed replacing of multiple rows of values

0 Answers0