0

I am trying to count he same addresses and group them by rows. I am reasonably close but have subtle differences across the columns between for the specific address. The aim is to remove any addresses from the row that are not matching and add them as a new row to the df. The differences are oftern between the street number or block number. I've extracted those numbers from the code ad am trying to find those that dont match, remove them and make a new row and change the count appropriate. The count change can be do after, just checking non-missing across the rows.

The dataset actually has 5000 rows with up to 50 buildings in one row. here is a sample.

 df<-data.frame(bldg1 = c("26 this street, big district","block8, fancy estate, small district", "11 normal lane, district"),
                bldg2 = c("27 this street, big district","block8, fancy estate, small district", "11 normal lane, district"),
                bldg3 = c("26 this street, big district","block6, fancy estate, small district", "11 normal lane, district"),
                bldg4 = c("26 this street, big district","block8, fancy estate, small district", NA),
                bldg5 = c("26 this street, big district","block6, fancy estate, small district", "11 normal lane, district"),
                bldg1strnum = c("26",NA, "11"),
                bldg2strnum = c("27",NA, "11"),
                bldg3strnum = c("26",NA, "11"),
                bldg4strnum = c("26",NA, "11"),
                bldg5strnum = c("26",NA, "11"),
                bldg1blck = c(NA,"8", NA),
                bldg2blck = c(NA,"8", NA),
                bldg3blck = c(NA,"6", NA),
                bldg4blck = c(NA,"8", NA),
                bldg5blck = c(NA,"6", NA),
               count = (5,5,4))

I was thinking of using dplyr and across along with length(unique) but dont know how to run it correctly especially how to mutate it into a long format for a new line.

What I like to end up with would be like this below. (street numbers and names arent necessary after mutation

 df<-data.frame(bldg1 = c("26 this street, big district","block8, fancy estate, small district", "11 normal lane, district", "27 this street, big district","block6, fancy estate, small district"),
               bldg2 = c(NA, "block8, fancy estate, small district", "11 normal lane, district",NA,"block6, fancy estate, small district"),
               bldg3 = c("26 this street, big district",NA, "11 normal lane, district", NA, NA),
               bldg4 = c("26 this street, big district","block8, fancy estate, small district", NA,NA,NA),
               bldg5 = c("26 this street, big district",NA, "11 normal lane, district",NA,NA),
               count = ("4","3","4","1","2"))
Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
Benj Young
  • 59
  • 5

1 Answers1

0

Is this what you are looking for:

df %>% 
  select(bldg1, bldg2, bldg3, bldg4, bldg5) %>% 
  pivot_longer(
    cols = everything()
  ) %>% 
  arrange(value) %>% 
  add_count(value)

Output:

   name  value                                    n
   <chr> <chr>                                <int>
 1 bldg1 11 normal lane, district                 4
 2 bldg2 11 normal lane, district                 4
 3 bldg3 11 normal lane, district                 4
 4 bldg5 11 normal lane, district                 4
 5 bldg1 26 this street, big district             4
 6 bldg3 26 this street, big district             4
 7 bldg4 26 this street, big district             4
 8 bldg5 26 this street, big district             4
 9 bldg2 27 this street, big district             1
10 bldg3 block6, fancy estate, small district     2
11 bldg5 block6, fancy estate, small district     2
12 bldg1 block8, fancy estate, small district     3
13 bldg2 block8, fancy estate, small district     3
14 bldg4 block8, fancy estate, small district     3
15 bldg4 NA                                       1
TarJae
  • 72,363
  • 6
  • 19
  • 66
  • Not, quite I need to keep those addresses which are the same within the same row, and the addresses have some subtle differences in the spelling so you cant group on them, thats why I extract the street/block number within the row and use that to see if the address is different – Benj Young Apr 26 '21 at 04:33