find partially similar string elements and summarize the data

Question

I have a dataset and I want to summarize my data based on (let's say) the first three characters. in fact, concatenate rows which have the same 3 first letter in the column. For example:

df
title freq
ACM100    3
ACM200    2
ACM300    2
MAT11     1
MAT21     2
CMP00     3
CMP10     3

I want to summarize the database on the title of first 3 characters and count the frequency.

result:
title  freq
ACM    7
MAT    3
CMP    6

Would be appreciated to help me in R.

Your first step is to create the new field, such as `substr(df$title, 1, 3)`, and then do normal aggregation/summarization on that. There are plenty of Q/As on SO (including `group_by(letters) %>% summarise(...)` that @CalumYou just posted) on group aggregation, and many types of solutions ranging from base-R to `dplyr` and `data.table`, depending on your current toolkit. — r2evans, Oct 23 '18 at 19:48

score 3 · Answer 1 · answered Oct 23 '18 at 19:46

We can use separate to split out the letters to use as groups, and then group_by and summarise to get the desired result.

library(tidyverse)
df <- read_table2(
"title freq
ACM100    3
ACM200    2
ACM300    2
MAT11     1
MAT21     2
CMP00     3
CMP10     3"
)
df %>%
  separate(title, c("letters", "numbers"), sep = 3) %>%
  group_by(letters) %>%
  summarise(freq = sum(freq))
#> # A tibble: 3 x 2
#>   letters  freq
#>   <chr>   <int>
#> 1 ACM         7
#> 2 CMP         6
#> 3 MAT         3

Created on 2018-10-23 by the reprex package (v0.2.0).

score 2 · Accepted Answer · answered Oct 23 '18 at 19:50

2

You can use aggregate with transform

aggregate(freq ~ title, transform(df, title = substr(title, 1, 3)), sum)
#   title freq
# 1   ACM    7
# 2   CMP    6
# 3   MAT    3

answered Oct 23 '18 at 19:50

Rich Scriven

97,041
11
181
245

score 1 · Answer 3 · answered Oct 23 '18 at 20:12

Because you tagged this question with regex and have not yet received a data.table answer, here is one more option for you

library(data.table)
setDT(df)
df[, .(freq = sum(freq)), by = .(title = sub("[0-9]+", "", title))]
#   title freq
#1:   ACM    7
#2:   MAT    3
#3:   CMP    6

score 0 · Answer 4 · answered Oct 23 '18 at 19:48

This works.

df$firstletters <- substr(df$title,1,3)
df.grouped <- df %>% group_by(firstletters)
df.summarized <- df.grouped %>% summarize(count = sum(freq))
> df.summarized
# A tibble: 3 x 2
  firstletters count
  <chr>        <int>
1 ACM              7
2 CMP              6
3 MAT              3

find partially similar string elements and summarize the data

4 Answers4