3

How can I achieve such a per-country imputation for each indicator in pandas?

I want to impute the missing values per group

  • no-A-state should get np.min per indicatorKPI
  • no-ISO-state should get the np.mean per indicatorKPI
  • for states with missing values, I want to impute with the per indicatorKPI mean. Here, this would mean to impute the missing values for Serbia

    mydf = pd.DataFrame({'Country':['no-A-state','no-ISO-state','germany','serbia', 'austria', 'germany','serbia', 'austria',], 'indicatorKPI':[np.nan,np.nan,'SP.DYN.LE00.IN','NY.GDP.MKTP.CD','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN'], 'value':[np.nan,np.nan,0.9,np.nan,0.7, 0.2, 0.3, 0.6]}) enter image description here

edit

The desired output should be similar to

mydf = pd.DataFrame({'Country':['no-A-state','no-ISO-state', 'no-A-state','no-ISO-state',
                                'germany','serbia','serbia', 'austria', 
                                'germany','serbia', 'austria',],
                   'indicatorKPI':['SP.DYN.LE00.IN','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN',
                                   'SP.DYN.LE00.IN','NY.GDP.MKTP.CD','SP.DYN.LE00.IN','NY.GDP.MKTP.CD','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN'],
                     'value':['MIN of all for this indicator', 'MEAN of all for this indicator','MIN of all for this indicator','MEAN of all for this indicator', 0.9,'MEAN of all for SP.DYN.LE00.IN indicator',0.7, 'MEAN of all for NY.GDP.MKTP.CD indicator',0.2, 0.3, 0.6]
                   })

enter image description here

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
  • 1
    Regarding the desired output, are these calculations calculated separate from each other are they dependent? For instance if you set the value for the first requirement to be the min then this affects the mean value for the column after adding the min value – EdChum Sep 21 '16 at 12:45
  • 1
    `mydf.loc[mydf['Country'] == 'no-A-state', 'value'] = mydf['value'].min() mydf.loc[mydf['Country'] == 'no-ISO-state', 'value'] = mydf['value'].mean() mydf.loc[mydf['value'].isnull(), 'value'] = mydf['indicatorKPI'].map(mydf.groupby('indicatorKPI')['value'].mean())` works but may not be what you desired depending on my previous comment – EdChum Sep 21 '16 at 12:45
  • 1
    @EdChum indeed. I would simply suggest to fill with MIN first and calculate MEAN afterwards – Georg Heiler Sep 21 '16 at 12:50
  • 1
    @EdChum: `mydf.loc[mydf['Country'] == 'no-A-state' ...` would you use a loop over all group_by? – Georg Heiler Sep 21 '16 at 12:51
  • 1
    the `map` takes care of that, did my code snippet work then? – EdChum Sep 21 '16 at 12:53

1 Answers1

3

Based on your new example df the following works for me:

In [185]:
mydf.loc[mydf['Country'] == 'no-A-state', 'value'] = mydf['value'].min()
mydf.loc[mydf['Country'] == 'no-ISO-state', 'value'] = mydf['value'].mean()
mydf.loc[mydf['value'].isnull(), 'value'] = mydf['indicatorKPI'].map(mydf.groupby('indicatorKPI')['value'].mean())
mydf

Out[185]:
         Country    indicatorKPI     value
0     no-A-state  SP.DYN.LE00.IN  0.200000
1   no-ISO-state  NY.GDP.MKTP.CD  0.442857
2     no-A-state  SP.DYN.LE00.IN  0.200000
3   no-ISO-state  SP.DYN.LE00.IN  0.442857
4        germany  NY.GDP.MKTP.CD  0.900000
5         serbia  SP.DYN.LE00.IN  0.328571
6         serbia  NY.GDP.MKTP.CD  0.700000
7        austria  NY.GDP.MKTP.CD  0.585714
8        germany  SP.DYN.LE00.IN  0.200000
9         serbia  NY.GDP.MKTP.CD  0.300000
10       austria  SP.DYN.LE00.IN  0.600000

Basically what this does is to fill the missing values for each condition, so we set the min for the 'no-A-state' countries, then mean for 'no-ISO-state' countries. We then groupby on 'indicatorKPI' and calc the mean for each group and assign again to the null value rows, the respective countries' mean using map which performs a lookup

Here are the steps broken down:

In [187]:
mydf.groupby('indicatorKPI')['value'].mean()

Out[187]:

indicatorKPI
NY.GDP.MKTP.CD    0.633333
SP.DYN.LE00.IN    0.400000
Name: value, dtype: float64

In [188]:
mydf['indicatorKPI'].map(mydf.groupby('indicatorKPI')['value'].mean())

Out[188]:
0     0.400000
1     0.633333
2     0.400000
3     0.400000
4     0.633333
5     0.400000
6     0.633333
7     0.633333
8     0.400000
9     0.633333
10    0.400000
Name: indicatorKPI, dtype: float64
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • 1
    Howver it is a bit strange that `mydf['indicatorKPI'].map(mydf.groupby('indicatorKPI')['value'].mean())` returns an error on my real, bigger dataset: wrong number of items passed: #7_numberOfGroubs, placement implies 767_numberOfItems – Georg Heiler Sep 21 '16 at 14:38
  • 1
    can't comment without being there, in principle this should work, try restarting your kernel in case you have some dodgy temporaries – EdChum Sep 21 '16 at 15:03
  • 1
    Though, I still face a problem with the 2 special cases of no-A and no-ISO state if I want to use `.map(mydf.groupby('indicatorKPI')['value'].mean())` no value is assigned -> even though it returns a big list of values. 2) unlike to your example code for these special cases I do not get them filled up with indicatorKPI and the MIN/MEAN value but rather only the imputed value --> I only get a single result and not per the number of indicators – Georg Heiler Sep 21 '16 at 15:19
  • 1
    Sorry I don't understand how you're intending to fill the KPI columns, this wasn't clear in your question I don't see how you would fill these with arbitrary values – EdChum Sep 21 '16 at 15:25
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/123874/discussion-between-georg-heiler-and-edchum). – Georg Heiler Sep 21 '16 at 15:27