0

I have a large data set where each zipcodes have their corresponding latitude and longitude. In the data set some zipcodes are missing. I need to fill in the missing zipcodes on the basis of their corresponding lat long where that data is not missing. In this example I would want rows 4 and 5 to have a and b inserted for zipcode as they have the same lat and long:

zipcode <- c("a","b","c","","")
lat <- c("1","2","3","1","2")
lon <- c("6","7","8","6","7")
data.frame(zipcode,lat,lon)
  zipcode lat lon
1       a   1   6
2       b   2   7
3       c   3   8
4           1   6
5           2   7

I'd prefer to not install another package unless really necessary.

Thank you

  • Are you certain that lat/lon from the missing-zipcode rows are always going to be found in the other rows? Or is it possible that they may be slightly different (some distance away but still within the same zipcode)? If the latter, one *might* do a distance calc between unknown lat/lon and one lat/lon per known zipcode, but this is imperfect ... it would likely be better with some online lookup API. – r2evans Jan 27 '20 at 20:27
  • Is this something where you only ever expect 1 set of coordinates per zipcode, and you just need to fill it in, as would be the case if these were centroids of zipcode areas? If not, how are you trying to assign zipcodes? Like would you ever have (1, 7)? – camille Jan 27 '20 at 20:53

1 Answers1

1

Using na_if from dplyr to replace blank values with NA in zipcode column and then use fill from tidyr:

library(dplyr)
library(tidyr)
df %>%
    group_by(lat, lon) %>% 
    mutate(zipcode = na_if(zipcode, "")) %>% 
    fill(zipcode)

#   zipcode lat   lon  
   #<fct>   <fct> <fct>
   #1 a       1     6    
   #2 b       2     7    
   #3 c       3     8    
   #4 a       1     6    
   #5 b       2     7 
sm925
  • 2,648
  • 1
  • 16
  • 28
  • My data set has around 50,000 rows and 110 columns. Is this going to affect other columns as well? – wickedpanda Jan 27 '20 at 20:33
  • @wickedpanda No, those other columns, if NA will stay NA. `mutate_all(na_if,"")` does replace all `""` with NA, but only `zipcode` is filled in. – userABC123 Jan 27 '20 at 20:41
  • @wickedpanda I have made an update to fill NA only for `zipcode` column. – sm925 Jan 27 '20 at 20:44
  • I tried this code. I'm getting this warning "`mutate_all()` ignored the following grouping variables: Columns `latitude`, `longitude` Use `mutate_at(df, vars(-group_cols()), myoperation)` to silence the message." – wickedpanda Jan 27 '20 at 20:45
  • *"I'd prefer to not install another package unless really necessary"* to me means installing a package only if required. `library(tidyverse)` is the diametric opposite of that ... I don't know of any package or meta-package that, when installed, will result in more packages being installed (though I'm sure there are larger ones out there). (On my system, this is over 50 new packages.) Perhaps you could reduce your `library` calls to *just* the relevant packages? – r2evans Jan 27 '20 at 20:58
  • thanks. But, how much time is this supposed to take? It has been running for the last 10 mins. – wickedpanda Jan 27 '20 at 21:01
  • @sm925: The code was taking so long, so, I filtered the data and brought it to around 10000 rows. The code ran successfully but the job was not done. Do you need any more information regarding the data set? – wickedpanda Jan 27 '20 at 21:36
  • It would be helpful if you could add reproducible example of subset of your data to show where code is not working. Then I can help. – sm925 Jan 27 '20 at 21:40