Filling in missing value based on values in both preceding and succeeding rows

Question

I have a dataset analogous to the one below where for a website I have the number of views every month for two years (2001-2002). However, due to the way the data was gathered, I only have information for a website if it had > 0 views. So, I am trying to fill in the number of views for months where that is not the case: i.e., cases where the website was online but had no views.

Unfortunately, I have no information for when the website was first published, so I assume that it was introduced the first time there are non-zero values for a month. I also assume the website was taken down if there are consecutive months with np.nan values at the end of 2002.

So, currently, the Views column has np.nan values for both months where views are zero, and the website was simply not online.

I want to make sure that months with zero views have 0 in the Views column, such that the below data frame,

Website ,Month,Year ,Views
1,January,2001,
1,February,2001,
1,March,2001,3.0
1,April,2001,4.0
1,May,2001,23.0
1,June,2001,
1,July,2001,5.0
1,August,2001,4.0
1,September,2001,3.0
1,October,2001,3.0
1,November,2001,3.0
1,December,2001,35.0
1,January,2002,6.0
1,February,2002,
1,March,2002,3.0
1,April,2002,
1,May,2002,
1,June,2002,3.0
1,July,2002,3.0
1,August,2002,2.0
1,September,2002,
1,October,2002,
1,November,2002,
1,December,2002,
2,January,2001,3.0
2,February,2001,1.0
2,March,2001,2.0
2,April,2001,2.0
2,May,2001,22.0
2,June,2001,
2,July,2001,4.0
2,August,2001,3.0
2,September,2001,3.0
2,October,2001,4.0
2,November,2001,
2,December,2001,1.0
2,January,2002,
2,February,2002,4.0
2,March,2002,2.0
2,April,2002,5.0
2,May,2002,2.0
2,June,2002,
2,July,2002,2.0
2,August,2002,3.0
2,September,2002,
2,October,2002,
2,November,2002,2.0
2,December,2002,5.0

looks like this:

Website ,Month,Year ,Views
1,January,2001,
1,February,2001,
1,March,2001,3.0
1,April,2001,4.0
1,May,2001,23.0
1,June,2001,0.0
1,July,2001,5.0
1,August,2001,4.0
1,September,2001,3.0
1,October,2001,3.0
1,November,2001,3.0
1,December,2001,35.0
1,January,2002,6.0
1,February,2002,0.0
1,March,2002,3.0
1,April,2002,0.0
1,May,2002,0.0
1,June,2002,3.0
1,July,2002,3.0
1,August,2002,2.0
1,September,2002,
1,October,2002,
1,November,2002,
1,December,2002,
2,January,2001,3.0
2,February,2001,1.0
2,March,2001,2.0
2,April,2001,2.0
2,May,2001,22.0
2,June,2001,0.0
2,July,2001,4.0
2,August,2001,3.0
2,September,2001,3.0
2,October,2001,4.0
2,November,2001,0.0
2,December,2001,1.0
2,January,2002,0.0
2,February,2002,4.0
2,March,2002,2.0
2,April,2002,5.0
2,May,2002,2.0
2,June,2002,0.0
2,July,2002,2.0
2,August,2002,3.0
2,September,2002,0.0
2,October,2002,0.0
2,November,2002,2.0
2,December,2002,5.0

In other words, if all preceding months for that website show np.nan values, and the current value is np.nan, it should remain that way. Similarly, if all following months show np.nan, the column should remain np.nan as well. However, if at least one preceding month is not np.nan the value should change to 0, etc.

The tricky part is that my dataset has about 4,000,000 rows, and I need a fairly efficient way to do this.

Does anyone have any suggestions?

score 0 · Answer 1 · answered Dec 20 '19 at 19:42

0

Here's my approach

# s counts the non-null views so far
s = df['Views'].notnull().groupby(df['Website']).cumsum()

# fill the null only where s > 0
df['Views'] = np.where(df['Views'].isna() & s.gt(0), 0, df['Views'])

# equivalent
# df.loc[df['View'].isna() & s.gt(0), 'Views'] = 0

answered Dec 20 '19 at 19:42

Quang Hoang

146,074
10
56
74

Thanks -- this worked! I simply had to perform the same step with cumsum() from later to earlier dates to account for the website being taken down. – ucmom Dec 21 '19 at 00:28

score 0 · Answer 2 · answered Dec 21 '19 at 00:33

I followed Quang Hoang's response and used the below code, which worked perfectly:

#Same as Quang Hoang's answer:
s = df['Views'].notnull().groupby(df['Website']).cumsum() 

#Count the non-null views so far but starting with the last observations
b = df['Views'].notnull()[::-1].groupby(df['Website']).cumsum()

# fill the null only where s > 0 and b > 0 
df['Views'] = np.where(df['Views'].isna() & (s.gt(0) & b.gt(0)), 0, df['Views'])

Filling in missing value based on values in both preceding and succeeding rows

2 Answers2