I have multiple addresses I want to group together and create a tally for. However they have variation in the formats. I've geocoded the addresses and plan to group them using the geocodes however when grouping them I want to create a new variable which retains at least one version of the address (or multiple variables with each address in the group in a wide format, but I'd settle for just one variable for each group with one address retained).
Here's some example data.
address=c("big fake plaza, 12 this street,district, city",
"Green mansion, district, city",
"Block 7 of orange building district, city",
"98 main street block a blue plaza, city",
"blue red mansion, 46 pearl street, city",
"12 this street, big fake plaza, district, city",
"Green mansion, district, city",
"orange building Block 7 district, city",
"block a 98 main street blue plaza, city",
"blue red mansion, 46 pearl street, city"
"big fake plaza, district, city",
"Green mansion,city")
long =c("112.8838", "111.9154", "114.9318", "116.9318", "112.9320","111.9324",
"112.8838", "111.9154", "114.9318", "116.9318", "112.9320","111.9324",
"112.8838", "111.9154")
lat = c("21.22177", "12.22177", "26.27743", "23.17651", "23.24769", "23.24771",
"21.22177", "12.22177", "26.27743", "23.17651", "23.24769", "23.24771",
"21.22177", "12.22177")
df<-cbind(address, lat, long)
Where I've got to is grouping and counting but dont know how to mutate and create a naming variable based off just one address in the group.
df_agg<- df %>%
group_by(long,lat) %>%
summarise(count = n()) %>%
mutate(bldg = ifelse(address[address==1],address, NA )) ???????
what i'd like for it to look like would be something like this
long lat count bldg
<dbl> <dbl> <int> <chr>
1 112. 21.2 3 "big fake plaza, 12 this street,district, city"
2 114. 12.2 3 "Green mansion, district, city"
3 116. 26.3 2 "98 main street block a blue plaza, city"
4 112. 23.5 2 "Block 7 of orange building district, city"
5 111. 23.5 2 "blue red mansion, 46 pearl street, city"
Obviously we cant group on the address names bcause there is varitaion between the strings. Happy to hear any other suggestions if there are better options. If we could create new variables bldg1 blgd2 ect. for each varying building name in each group that be great to but not a priority.
Thanks in advance.