Select rows from DataFrame based on condition

Question

I have a DataFrame with multiple columns, each column contains NaN values at different index positions. The indexes are 5 minute frequency datetimes.

Take the first column: Select the rows until the first NaN row, or until the first NaN range if adjacent NaN values follow each other.

I'd like to forecast from the value sequence the NaN value, or NaN values.

The next iteration: Select the rows (from the beginning) until the next NaN value or NaN adjacent range, and select the NaN values/range as well. Forecast the NaN values/range.

This goes on until the column doesn't contain NaNs.

import pandas as pd
import numpy as np

column = ['values']
data = [1,2,3,4, np.nan, 6, 7, 8, np.nan, np.nan, np.nan, 12, 13, 14, np.nan, np.nan, 17, 18]
DateList_ = pd.date_range(start='2018-10-29 10:00:00', end='2018-10-29 11:25:00 ', freq='5T')

df_ = pd.DataFrame(data=data, columns=column, index=DateList_)

First iteration: values = [1,2,3,4], nan_periods=1 -> Forecast [5] from values

Second iteration: values = [1,2,3,4,5,6,7,8] nan_periods=3 -> Forecast [9,10,11] from values

Third iteration: values = [1,2,3,4,5,6,7,8,9,10,11,12,13,14] -> Forecast [15,16] from values

Iteration stops, no more NaN values.

The forecasting is done with Prophet. I'd like to ask some help with the selection part, to iteratively select rows that contain values, and the following nan/s.

Unfortunately, I can't use interpolation! The real sequences I have are far more complex and not LINEAR like this dummy example! This example is only used to demonstrate the selection!

`Please address the solution I asked`... What you've asked is literally a linear interpolation. If you want some other method of interpolation or application you'll have to be more specific... — yatu, May 15 '19 at 10:20
I edited the post to indicate that no interpolation could be used. — hk_03, May 15 '19 at 10:21
Man! The sequence is not linear! It's just a dummy example, The real question is how to select the values, and the nan ranges iteratively! — hk_03, May 15 '19 at 10:23
If you want to find where NaNs or subsequent NaNs, you can use something like `df_['values'].isna().astype(int).diff().gt(0)`. You'll have to do the same to get the indices where these subsequent values end. You'll surely find something similar here in SO — yatu, May 15 '19 at 10:31
No... this is just finding the first index where a group of consecutive nans occur. — yatu, May 15 '19 at 10:37

jezrael · Answer 1 · 2019-05-15T10:44:53.933

1

Idea is create groups with consecutive NaNs in Series g, then loop by groupby and get variables for forecast, for select by position is necessary use Index.get_loc for position by first value of groups and for position from column name, then set values by loc for replace NaNs:

s = df_['values'].isna()
g = s.ne(s.shift()).cumsum()[s]
for i, x in df_.groupby(g):
    nan_periods = len(x)
    values = df_.iloc[:df_.index.get_loc(x.index[0]), df_.columns.get_loc('values')]
    print (nan_periods)
    print (values)
    #sample data
    Forecast = 10
    df_.loc[x.index, 'values'] = Forecast


print (df_)
                     values
2018-10-29 10:00:00     1.0
2018-10-29 10:05:00     2.0
2018-10-29 10:10:00     3.0
2018-10-29 10:15:00     4.0
2018-10-29 10:20:00    10.0
2018-10-29 10:25:00     6.0
2018-10-29 10:30:00     7.0
2018-10-29 10:35:00     8.0
2018-10-29 10:40:00    10.0
2018-10-29 10:45:00    10.0
2018-10-29 10:50:00    10.0
2018-10-29 10:55:00    12.0
2018-10-29 11:00:00    13.0
2018-10-29 11:05:00    14.0
2018-10-29 11:10:00    10.0
2018-10-29 11:15:00    10.0
2018-10-29 11:20:00    17.0
2018-10-29 11:25:00    18.0

edited May 15 '19 at 10:44

answered May 15 '19 at 10:38

jezrael

822,522
95
1,334
1,252

Thanks a lot! Trying to understand your code now :D – hk_03 May 15 '19 at 10:42
@hk_03 - OK, if something is necessary explain more, let me know. – jezrael May 15 '19 at 10:45
Is it possible to wrap this with tqdm? – hk_03 May 15 '19 at 10:49
@hk_03 - I have no experience with it, so no idea. – jezrael May 15 '19 at 10:52
1

As far as I understand, almost anything with for can be wrapped in tqdm. – Quang Hoang May 15 '19 at 11:49
@QuangHoang - do you have some idea, how it should be implemented? – jezrael May 15 '19 at 11:51
1

I guess: `for i, x in tqdm.tqdm(df_.groupby(g)):`. If it works it works, else no :-) – Quang Hoang May 15 '19 at 11:53

Select rows from DataFrame based on condition

1 Answers1