11

I have an incomplete dataframe, incomplete_df, as below. I want to impute the missing amounts with the average amount of the corresponding id. If the average for that specific id is itself NaN (see id=4), I want to use the overall average.

Below are the example data and my highly inefficient solution:

import pandas as pd
import numpy as np
incomplete_df = pd.DataFrame({'id': [1,2,3,2,2,3,1,1,1,2,4],
                              'type': ['one', 'one', 'two', 'three', 'two', 'three', 'one', 'two', 'one', 'three','one'],
                         'amount': [345,928,np.NAN,645,113,942,np.NAN,539,np.NAN,814,np.NAN] 
                         }, columns=['id','type','amount'])

# Forrest Gump Solution
for idx in incomplete_df.index[np.isnan(incomplete_df.amount)]: # loop through all rows with amount = NaN
    cur_id = incomplete_df.loc[idx, 'id']
    if (cur_id in means.index ):
        incomplete_df.loc[idx, 'amount'] = means.loc[cur_id]['amount'] # average amount of that specific id.
    else:
        incomplete_df.loc[idx, 'amount'] = np.mean(means.amount) # average amount across all id's

What is the fastest and the most pythonic/pandonic way to achieve this?

Zhubarb
  • 11,432
  • 18
  • 75
  • 114
  • in 0.13 you can do this: http://pandas.pydata.org/pandas-docs/dev/missing_data.html#interpolation and see missing values section: http://pandas.pydata.org/pandas-docs/dev/missing_data.html#cleaning-filling-missing-data – Jeff Jan 10 '14 at 17:24
  • 1
    @Jeff, thank you. What I have is not a timeseries though. So, statistically I am looking for imputation rather than (inter/extra)polation. How can I handle the situation where the average of a given id is itself `NaN`? – Zhubarb Jan 10 '14 at 17:26
  • see @DSM solution below as its prob what you want. But interpolation CAN work on Frames FYI; a timeseries is irrelevant. Their are many options. – Jeff Jan 10 '14 at 17:38
  • @Jeff your section link is now https://pandas-docs.github.io/pandas-docs-travis/missing_data.html#cleaning-filling-missing-data . Also it would be helpful to add the OP's comment to doc: pandas imputation is not just for timeseries, and the terms 'backward','forward' should be avoided (just say 'missing') for non-sequential, non-timeseries data. Also, useful to say pandas only provides single imputation, not multiple imputation; see third-party packages like `fancyimpute` etc. – smci Apr 12 '18 at 21:29

1 Answers1

23

Disclaimer: I'm not really interested in the fastest solution but the most pandorable.

Here, I think that would be something like:

>>> df["amount"].fillna(df.groupby("id")["amount"].transform("mean"), inplace=True)
>>> df["amount"].fillna(df["amount"].mean(), inplace=True)

which produces

>>> df
    id   type  amount
0    1    one   345.0
1    2    one   928.0
2    3    two   942.0
3    2  three   645.0
4    2    two   113.0
5    3  three   942.0
6    1    one   442.0
7    1    two   539.0
8    1    one   442.0
9    2  three   814.0
10   4    one   615.2

[11 rows x 3 columns]

There are lots of obvious tweaks depending upon exactly how you want the chained imputation process to go.

DSM
  • 342,061
  • 65
  • 592
  • 494
  • Thank you DSM, if I had a placeholder, e.g. 0, in the place of the NaN's, would the fastest way be replacing back all 0's as NaN's and then follow your solution? ( I know putting 0's in place of NaN's is a stupid idea - alas what I have to work with is what I've got.) – Zhubarb Jan 10 '14 at 17:37
  • 1
    That's probably what I'd do. The use of NaN to represent missing data runs pretty deep in pandas, and so the simplest native way to do something usually requires getting your data aligned to that. I don't know about fastest-- you can use timeit to compare. Swapping 0s for NaNs will be both linear and vectorized anyway, so it wouldn't add much to the runtime. – DSM Jan 10 '14 at 17:43
  • 1
    just to confirm, those are two alternatives, and not both necessary to complete the single operation? – 3pitt Jan 19 '18 at 17:13