Count how many times strings from one data frame appear to another data frame in R dplyr

Question

I have two data frames that look like this:

df1 <- data.frame(reference=c("cat","dog"))
print(df1)
#>   reference
#> 1       cat
#> 2       dog
df2 <- data.frame(data=c("cat","car","catt","cart","dog","dog","pitbull"))
print(df2)
#>      data
#> 1     cat
#> 2     car
#> 3    catt
#> 4    cart
#> 5     dog
#> 6     dog
#> 7 pitbull

^{Created on 2021-12-29 by the reprex package (v2.0.1)}

I want to find how many times the words cat and dog from the df1 exist in df2. I want my data to look like this

animals   n
cat       1
dog       2

Any help or guidance is appreciated. My reference list is huge. I tried to grep each one of them but ll take me time.

Thank you for your time. Happy holidays

Re: "I tried to grep each one" - you need grep and regex when you are doing pattern matching or partial string matching. When you are matching whole exact strings as you are here, you just need `==` or `%in%` or other non-regex functions (as all the answers here illustrate). — Gregor Thomas, Dec 29 '21 at 20:27

TarJae · Answer 1 · 2021-12-29T20:29:01.390

6

Update: Thanks to Gregor Thomas:

library(dplyr)

left_join(df1,df2, by=c("reference"="data")) %>% 
  count(reference)

output:

  reference n
1       cat 1
2       dog 2

We could use semi_join and then count:

library(dplyr)

semi_join(df2,df1, by=c("data"="reference")) %>% 
  count(data)

  data n
1  cat 1
2  dog 2

edited Dec 29 '21 at 20:29

answered Dec 29 '21 at 20:21

TarJae

72,363
6
19
66

2

I would stick with a `left_join(df1, df2)` unless OP clearly specifies that they want to omit `reference` rows with 0 counts. – Gregor Thomas Dec 29 '21 at 20:25
Thanks Gregor Thomas. Will update. – TarJae Dec 29 '21 at 20:26

score 4 · Answer 2 · answered Dec 29 '21 at 20:16

It may be faster with a join

library(data.table)
setDT(df2)[, .(animals = data)][df1, .(n = .N), 
     on = .(animals = reference), by = .EACHI]
   animals n
1:     cat 1
2:     dog 2

Or use table after subseting the data in base R

table(subset(df2, data %in% df1$reference, select = data))

PaulS · Accepted Answer · 2021-12-29T20:21:19.287

4

A possible solution, tidyverse-based:

library(tidyverse)

df1 <- data.frame(reference=c("cat","dog"))
df2 <- data.frame(data=c("cat","car","catt","cart","dog","dog","pitbull"))

df1 %>% 
  group_by(animal = reference) %>% 
  summarise(n = sum(reference == df2$data), .groups = "drop")

#> # A tibble: 2 × 2
#>   animal     n
#>   <chr>  <int>
#> 1 cat        1
#> 2 dog        2

edited Dec 29 '21 at 20:21

answered Dec 29 '21 at 20:18

PaulS

21,159
2
9
26

score 2 · Answer 4 · answered Dec 29 '21 at 20:20

Here is a third option:

library(tidyverse)

df1 <- tibble(reference=c("cat","dog"))
df2 <- tibble(data=c("cat","car","catt","cart","dog","dog","pitbull"))

df2 |>
  count(data) |>
  filter(data %in% df1$reference) |>
  rename(animal = data)
#> # A tibble: 2 x 2
#>   animal     n
#>   <chr>  <int>
#> 1 cat        1
#> 2 dog        2

jpdugo17 · Answer 5 · 2021-12-29T23:05:58.207

2

We can use str_count with the column in the second df collapsed into one string.

library(tidyverse)

df1 %>%
  transmute(animals = reference, n = str_c(df2$data, collapse = " ") %>%
    str_count(str_c("\\b", reference, "\\b")) )
#>   animals n
#> 1     cat 1
#> 2     dog 2

^{Created on 2021-12-29 by the reprex package (v2.0.1)}

edited Dec 29 '21 at 23:05

answered Dec 29 '21 at 22:54

jpdugo17

6,816
2
11
23

score 1 · Answer 6 · answered Dec 30 '21 at 01:37

1

df1$n <- colSums(outer(df2$data, df1$reference, '=='))

df1
#>   reference n
#> 1       cat 1
#> 2       dog 2

answered Dec 30 '21 at 01:37

IceCreamToucan

28,083
2
22
38

wow a very nice nice one. Thank you, its impressive – LDT Dec 30 '21 at 11:20

Count how many times strings from one data frame appear to another data frame in R dplyr

6 Answers6