6

Suppose I was trying to organize sales data for a membership business.

I only have the start and end dates. Ideally sales between the start and end dates appear as 1, instead of missing.

I can't get the 'date' column to be filled with in-between dates. That is: I want a continuous set of months instead of gaps. Plus I need to fill missing data in columns with ffill.

I have tried different ways such as stack/unstack and reindex but different errors occur. I'm guessing there's a clean way to do this. What's the best practice to do this?

Suppose the multiindexed data structure:

                 variable     sales
vendor date                 
a      2014-01-01  start date 1
       2014-03-01    end date 1
b      2014-03-01  start date 1
       2014-07-01    end date 1

And the desired result

                   variable   sales
vendor date                 
a      2014-01-01  start date 1
       2014-02-01  NaN        1
       2014-03-01    end date 1
b      2014-03-01  start date 1
       2014-04-01  NaN        1
       2014-05-01  NaN        1
       2014-06-01  NaN        1 
       2014-07-01    end date 1
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
LPG
  • 822
  • 1
  • 11
  • 20
  • Why the R and matlab tags? I'm taking them off, if you have a reason go ahead and put them back but state the reason! – Gregor Thomas Dec 02 '14 at 18:47
  • possible duplicate of [Filling in date gaps in MultiIndex Pandas Dataframe](http://stackoverflow.com/questions/17287933/filling-in-date-gaps-in-multiindex-pandas-dataframe) – Kirubaharan J Dec 02 '14 at 19:32
  • @KirubaharanJ That solution doesn't generate a full list of dates. – LPG Dec 02 '14 at 19:42

2 Answers2

5

you can do:

>>> f = lambda df: df.resample(rule='M', how='first')
>>> df.reset_index(level=0).groupby('vendor').apply(f).drop('vendor', axis=1)
                     variable  sales
vendor date                         
a      2014-01-31  start date      1
       2014-02-28         NaN    NaN
       2014-03-31    end date      1
b      2014-03-31  start date      1
       2014-04-30         NaN    NaN
       2014-05-31         NaN    NaN
       2014-06-30         NaN    NaN
       2014-07-31    end date      1

and then just .fillna on sales column if needed.

behzad.nouri
  • 74,723
  • 18
  • 126
  • 124
  • Thanks, that seems to work! The important step happens here: groupby('vendor').apply(f) . Can you just give me more clarity on why we need a lambda here? I'm not that experienced with Pandas... – LPG Dec 03 '14 at 15:48
  • @LPG `groupby(...).apply(f)` applies the given function to every group, see [here](http://pandas.pydata.org/pandas-docs/stable/groupby.html#flexible-apply) for more examples – behzad.nouri Dec 04 '14 at 12:38
0

I have a solution, but it's not really simple:

so, here's your DataFrame:

>>> df
                   sales date variable
vendor date                           
a      2014-01-01      1    start date
       2014-01-03      1      end date
b      2014-01-03      1    start date
       2014-01-07      1      end date

first, I want to create data for new MultiIndex:

>>> df2 = df.set_index('date variable', append=True).reset_index(level='date')['date']
>>> df2
vendor  date variable
a         start date    2014-01-01
          end date      2014-01-03
b         start date    2014-01-03
          end date      2014-01-07
>>> df2 = df2.unstack()
>>> df2
date variable   end date   start date
vendor                               
a             2014-01-03   2014-01-01
b             2014-01-07   2014-01-03

now, create tuples for new MultiIndex:

>>> tuples = [(x[0], d) for x in df3.iterrows() for d in pd.date_range(x[1]['start date'], x[1]['end date'])]
>>> tuples
[('a', '2014-01-01'), ..., ('b', '2014-01-07)]

and create MultiIndex and reindex():

>>> mi = pd.MultiIndex.from_tuples(tuples,names=df.index.names)
>>> df.reindex(mi)
                   sales date variable
vendor date                           
a      2014-01-01      1    start date
       2014-01-02    NaN           NaN
       2014-01-03      1      end date
b      2014-01-03      1    start date
       2014-01-04    NaN           NaN
       2014-01-05    NaN           NaN
       2014-01-06    NaN           NaN
       2014-01-07      1      end date
Roman Pekar
  • 107,110
  • 28
  • 195
  • 197
  • Thanks for this, you do get the result I imagined. The issue I foresee is that on the real dataset there will be 5 levels in the multiindex. So I was hoping there was a way to use reindex() just on the date column. Any ideas? – LPG Dec 02 '14 at 20:44
  • @LPG not sure if it's possbile without resetting index on the original DataFrame – Roman Pekar Dec 02 '14 at 21:35
  • Maybe there is a way to use the pivoted/unstacked structure? Having the dates as column names, 1 as values, and add one column per month? – LPG Dec 02 '14 at 21:50