I have a dataset analogous to the one below where for a website I have the number of views every month for two years (2001-2002). However, due to the way the data was gathered, I only have information for a website if it had > 0 views. So, I am trying to fill in the number of views for months where that is not the case: i.e., cases where the website was online but had no views.
Unfortunately, I have no information for when the website was first published, so I assume that it was introduced the first time there are non-zero values for a month. I also assume the website was taken down if there are consecutive months with np.nan
values at the end of 2002.
So, currently, the Views
column has np.nan
values for both months where views are zero, and the website was simply not online.
I want to make sure that months with zero views have 0 in the Views
column, such that the below data frame,
Website ,Month,Year ,Views
1,January,2001,
1,February,2001,
1,March,2001,3.0
1,April,2001,4.0
1,May,2001,23.0
1,June,2001,
1,July,2001,5.0
1,August,2001,4.0
1,September,2001,3.0
1,October,2001,3.0
1,November,2001,3.0
1,December,2001,35.0
1,January,2002,6.0
1,February,2002,
1,March,2002,3.0
1,April,2002,
1,May,2002,
1,June,2002,3.0
1,July,2002,3.0
1,August,2002,2.0
1,September,2002,
1,October,2002,
1,November,2002,
1,December,2002,
2,January,2001,3.0
2,February,2001,1.0
2,March,2001,2.0
2,April,2001,2.0
2,May,2001,22.0
2,June,2001,
2,July,2001,4.0
2,August,2001,3.0
2,September,2001,3.0
2,October,2001,4.0
2,November,2001,
2,December,2001,1.0
2,January,2002,
2,February,2002,4.0
2,March,2002,2.0
2,April,2002,5.0
2,May,2002,2.0
2,June,2002,
2,July,2002,2.0
2,August,2002,3.0
2,September,2002,
2,October,2002,
2,November,2002,2.0
2,December,2002,5.0
looks like this:
Website ,Month,Year ,Views
1,January,2001,
1,February,2001,
1,March,2001,3.0
1,April,2001,4.0
1,May,2001,23.0
1,June,2001,0.0
1,July,2001,5.0
1,August,2001,4.0
1,September,2001,3.0
1,October,2001,3.0
1,November,2001,3.0
1,December,2001,35.0
1,January,2002,6.0
1,February,2002,0.0
1,March,2002,3.0
1,April,2002,0.0
1,May,2002,0.0
1,June,2002,3.0
1,July,2002,3.0
1,August,2002,2.0
1,September,2002,
1,October,2002,
1,November,2002,
1,December,2002,
2,January,2001,3.0
2,February,2001,1.0
2,March,2001,2.0
2,April,2001,2.0
2,May,2001,22.0
2,June,2001,0.0
2,July,2001,4.0
2,August,2001,3.0
2,September,2001,3.0
2,October,2001,4.0
2,November,2001,0.0
2,December,2001,1.0
2,January,2002,0.0
2,February,2002,4.0
2,March,2002,2.0
2,April,2002,5.0
2,May,2002,2.0
2,June,2002,0.0
2,July,2002,2.0
2,August,2002,3.0
2,September,2002,0.0
2,October,2002,0.0
2,November,2002,2.0
2,December,2002,5.0
In other words, if all preceding months for that website show np.nan
values, and the current value is np.nan
, it should remain that way. Similarly, if all following months show np.nan
, the column should remain np.nan
as well. However, if at least one preceding month is not np.nan
the value should change to 0, etc.
The tricky part is that my dataset has about 4,000,000 rows, and I need a fairly efficient way to do this.
Does anyone have any suggestions?