1

I have a dataset (main dataset) which looks like this:

id cleaning_fee boro           zipcode           price
1  NA           Manhattan       10014            100
2  70           Manhattan       10013            125
3  NA           Brooklyn        11201            97
4  25           Manhattan       10012            110
5  30           Staten Island   10305            60

Grouping by Borough and Zipcode I get this (using na.rm = True):

borough   zipcode avgCleaningFee    
Brooklyn    11217   88.32000        
Brooklyn    11231   89.05085        
Brooklyn    11234   42.50000        
Manhattan   10003   97.03738        
Manhattan   10011   109.97647

What I want to do is impute the NAs in the 'cleaning_fee' variable in my main dataset by either:

(a) imputing the grouped mean (as shown above in table 2 where I group on 2 conditions)

or

(b) use KNN regression on variables such as zipcode, boro and the price to impute the cleaning fee variable. (PS I understand how KNN regression works but I haven't used it, would be great if you can explain the code in 1 line or so)

Would be great if anyone can help me out with this. Thanks!!

  • Do you want to *predict* "cleaning fee" for the rows with NA? Or are you really doing imputation in order to use these variables in another model? – Spacedman Nov 26 '18 at 19:06
  • Yes to the first question. I won't be using any further (ML) algorithms on the data once it's imputed. I just want to see how much the price is affected by adding this variable. Mostly the data (post imputation) will be used for visualizations and developing a break-even analysis using calculated fields – Anurag Kaushik Nov 26 '18 at 19:08

1 Answers1

2

We can use the first method

library(dplyr)
df1 %>%
   group_by(Borough, Zipcode) %>%
   mutate(cleaning_fee = replace(Cleaning_fee, 
            is.na(Cleaning_fee), mean(Cleaning_fee, na.rm = TRUE))

Or with na.aggregate from zoo

library(zoo)
df1 %>%
  group_by(Borough, Zipcode) %>%
  mutate(cleaning_fee = na.aggregate(cleaning_fee))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Hey this worked :D !! I was hoping if you could show me how to do the KNN regression that would be great. And it will help me show off in class that I didn't just use simple imputation :P – Anurag Kaushik Nov 26 '18 at 19:13
  • @AnuragKaushik You may check the `knn.reg` [here](https://stackoverflow.com/questions/47953889/knn-regression-in-r) – akrun Nov 26 '18 at 19:21
  • thanks a lot for providing me some direction! I'll look into it. – Anurag Kaushik Nov 26 '18 at 19:24
  • 1
    Will that method work if there are no other rows with the same zipcode as a missing value? Also, are zipcodes unique such that grouping by borough is pointless? – Spacedman Nov 26 '18 at 19:26
  • Q1: No idea, I'll have to test this. I know for a fact that each zip code has at least 1 value of cleaning fee that is a number and not NA, so maybe it works? Q2: You're right, zipcodes are unique so I don't really have to group by borough! Thanks, it's a force of habit to go from top (borough) to bottom (zip code) but I guess it's redundant and inefficient – Anurag Kaushik Nov 26 '18 at 19:29
  • Or simply `mutate(cleaning_fee = coalesce(Cleaning_fee, mean(Cleaning_fee, na.rm = TRUE))` – Axeman Nov 26 '18 at 21:28