-3

I have a pandas dataframe that looks like this:

    year    week  val1   val2
0   2017   45     10.1   20.2
0   2017   48     10.3   20.3
0   2017   49     10.4   20.4
0   2017   52     10.3   20.5
0   2018    1     10.1   20.2
0   2018    2     10.3   20.3
0   2018    5     10.4   20.4
0   2018    9     10.3   20.5
....

Notice that the weeks are not contiguous. What is the best way to fill in the rows that are missing, with the val1 and val2 numbers as NaN? E.g so that my year would be from 2017 to 2018 and my weeks would be 45-52 and 1-9.

Thanks so much.

BaJoe
  • 11
  • 1
  • 4

3 Answers3

2

You can groupby year and then reindex with the union of existing and missing values:

(df.set_index("week")
   .groupby("year")
   .apply(lambda x: x.reindex(x.index.union(np.arange(x.index.min(),x.index.max()))))
   .drop("year", 1)
   .reset_index()
   .rename(columns={"level_1":"week"}))

    year  week  val1  val2
0   2017    45  10.1  20.2
1   2017    46   nan   nan
2   2017    47   nan   nan
3   2017    48  10.3  20.3
4   2017    49  10.4  20.4
5   2017    50   nan   nan
6   2017    51   nan   nan
7   2017    52  10.3  20.5
8   2018     1  10.1  20.2
9   2018     2  10.3  20.3
10  2018     3   nan   nan
11  2018     4   nan   nan
12  2018     5  10.4  20.4
13  2018     6   nan   nan
14  2018     7   nan   nan
15  2018     8   nan   nan
16  2018     9  10.3  20.5
andrew_reece
  • 20,390
  • 3
  • 33
  • 58
1

I'd create a reference dataframe and merge

ref = pd.DataFrame(
    [[y, w] for y, s in df.groupby('year').week for w in range(s.min(), s.max() + 1)],
    columns=['year', 'week']
)

ref.merge(df, 'left')

    year  week  val1  val2
0   2017    45  10.1  20.2
1   2017    46   NaN   NaN
2   2017    47   NaN   NaN
3   2017    48  10.3  20.3
4   2017    49  10.4  20.4
5   2017    50   NaN   NaN
6   2017    51   NaN   NaN
7   2017    52  10.3  20.5
8   2018     1  10.1  20.2
9   2018     2  10.3  20.3
10  2018     3   NaN   NaN
11  2018     4   NaN   NaN
12  2018     5  10.4  20.4
13  2018     6   NaN   NaN
14  2018     7   NaN   NaN
15  2018     8   NaN   NaN
16  2018     9  10.3  20.5
piRSquared
  • 285,575
  • 57
  • 475
  • 624
1

I'd make use of Time Series / Date functionality. Combining and converting year and week columns into a datetime index and resampling your dataframe with something like:

df.index = pd.to_datetime(
    df.year.map(str) + " " + df.week.map(str) + " 3", 
    format="%Y %W %w"
)
df = df.resample("W").mean()
df.year = df.index.year
df.week = df.index.week

Note that your index is overwritten.

0asa
  • 224
  • 1
  • 8
  • I like this best, but on a larger dataset, why does it give me only 51 weeks in 2016? – BaJoe Aug 08 '18 at 10:50
  • That's weird. I tried with the following data: `df = pd.DataFrame({"year": [2016, 2016, 2017], "week": [1,4,5], "val1": [1,2,3]})` and I got 52 weeks in 2016. Can you give a example that produces only 51 weeks? – 0asa Aug 09 '18 at 18:58