1

Say I have a pd DataFrame that looks like this:

                            price     volume     cat_count 
zipcode      date
91111.0      01/01/2018     10        5          NaN
             02/10/2018     NaN       9          NaN
94312.0      04/04/2018     7         4          6
             02/10/2018     NaN       3          4
96666.0      05/05/2018     NaN       3          14
             02/10/2018     NaN       NaN        8
             07/08/2018     NaN       0          NaN
98432.0      06/08/2018     4         NaN        NaN

And say I have a dictionary whose keys are zipcodes and whose values are lists of nearby zipcodes (within x kilometers of they key zipcode), sorted by how close they are to the key zipcode with the closer ones appearing first. This dictionary looks like:

nearby_zips = {
     91111.0 : [94312.0],
     94312.0 : [91111.0, 96666.0],
     96666.0 : [94312.0],
     98432.0 : []
}

How can I efficiently interpolate the data so if for any column, all the values are NaNs in a zipcode index, find the nearest zipcode that has non NaN values for said column, and use these values to fill in the for the zipcode that has all NaN values for the column.

For reference output on the above example DataFrame would look like:

                            price     volume     cat_count 
zipcode      date
91111.0      01/01/2018     10        5          NaN
             02/10/2018     NaN       9          4
             04/04/2018     NaN       NaN        6
94312.0      04/04/2018     7         4          6
             02/10/2018     NaN       3          4
96666.0      05/05/2018     NaN       3          14
             02/10/2018     NaN       NaN        8
             07/08/2018     NaN       0          NaN
             04/04/2018     7         NaN        NaN
98432.0      06/08/2018     4         NaN        NaN

Notice the data in the zipcode index 91111.0 and 96666.0 and how they changed.

Mark Keane
  • 984
  • 2
  • 11
  • 26
  • Where does the `4` come from in the second row under `cat_count `? – yatu Dec 29 '18 at 20:49
  • the 91111.0 zipcode was close to 94312.0 according to our dict so because 02/10/2018 is in both, that value gets replaced in 91111.0 with the value from 94312.0 – Mark Keane Dec 29 '18 at 20:53
  • Oh I see, it is because its the same date right? `02/10/2018` – yatu Dec 29 '18 at 20:56
  • And why is `df.loc['96666.0','price']` not interpolated with `94312.0 04/04/2018 7 ',` given that they are all `NaNs` ? – yatu Dec 29 '18 at 21:12
  • yes, because it is the same date. and you are totally right, that was my mistake, df.loc['96666.0','price'] should be interpolated - i updated the question to reflect this. – Mark Keane Dec 29 '18 at 21:37
  • why does df.loc['96666.0','07/08/2018']['cat_count'] become 'NaN'? – xudesheng Dec 29 '18 at 21:44
  • thanks yatu! and @Windchill df.loc['96666.0','07/08/2018']['cat_count'] starts as NaN, it does not become NaN. it stays NaN because cat_count has non NaN values in it for the 96666.0 index and thus no interpolation occurs. – Mark Keane Dec 29 '18 at 21:47
  • @MarkKeane, my bad, I didn't notice df.loc['96666.0','04/04/2018'] is a new line. – xudesheng Dec 29 '18 at 21:49

0 Answers0