R - Handling character variables in data.table by matching values on the same prefix while truncating the diverging suffix

Question

I currently have a seemingly easy task, but as the data set is quite large, it is not feasible to do the task manually by hand.

The data.table stores one column of name and numerous other columns storing data corresponding to that name object, say var1, var2 and var3. The name variable is type character and the others are type numeric.

Now to the question. Names can start the same (same prefix) but might end differntly (diverging suffix). Imagine for example plant species of different classes or company names with differnt legal entitity form.

My generic question is, whether there is a simple method to match all values of the name variable that have the same prefix and assign them this "same" prefix while discarding the "diverging" suffix. See below for "abstract" example data.

EDIT Names are not separated by an underscore but by a space and can have several words (not solely constrained to two words)

EDIT 2 The longest shared prefix should be retained. Added some examples to illustrate that point. Proposed solutions would result in the last to entries being "American" instead "American Tulip".

Name                  Var1   Var2   Var3
Rose red              22     12     3.5
Rose Yellow           12     13     4.8
Rose orange           18     14     5.0
Rose dark orange      17     12     4.9
Tulip red             44     21     9.2
Tulip yellow          33     25     7.2
Tulip Orange          38     12     9.4
American Tulip red    22     12     3.3
American Tulip orange 31     16     4.5

I want to get to a "tidy" table like below to allow aggregation of data and computing statistics for each "most hierachical class".

Name                  Var1   Var2   Var3
Rose                  22     12     3.5
Rose                  12     13     4.8
Rose                  18     14     5.0
Rose                  17     12     4.9
Tulip                 44     21     9.2
Tulip                 33     25     7.2
Tulip                 38     12     9.4
American Tulip        22     12     3.3
American Tulip        31     16     4.5

I would highly appreciate any input, as I need to apply this logic to a data set with ~ 250k entries. Thanks in advance for any advice!

I'll give it a try in a minute. ```gsub``` might be a nice way to extract the names for matching! — aimbotter21, Apr 03 '21 at 15:50
Unfortunately this does not change the number of unique names in the first column. As such, I reckon that it did not change the names. Sry, my bad. I have names in my data base that arent actually seperated by an underscore ```_```, so I'll try and adjust to my example — aimbotter21, Apr 03 '21 at 15:57
I am not sure, but its working at my end. Its throwing what you are showing in the end table — PKumar, Apr 03 '21 at 15:59
Yeah, you are right! I am trying to edit your proposal as my names look like following on my end. "Rose right" / "Rose Yellow" / "Rose orange" "Rose dark orange" etc — aimbotter21, Apr 03 '21 at 16:01

Anoushiravan R · Accepted Answer · 2021-04-03T16:01:19.350

1

I think this will also help you if you are also interested in alternative solutions:

library(dplyr)
library(stringr)

df <- tribble(
  ~Name,           ~Var1,   ~Var2,   ~Var3,
  "Rose_red",       22,     12,     3.5,
  "Rose_yellow",    12,     13,     4.8,
  "Rose_orange",    18,     14,     5.0,
  "Tulip_red",      44,     21,     9.2,
  "Tulip_yellow",   33,     25,     7.2,
  "Tulip_orange",   38,     12,     9.4
)

df %>%
  mutate(Name = str_extract(Name, "[A-Za-z]+"))

# A tibble: 6 x 4
  Name   Var1  Var2  Var3
  <chr> <dbl> <dbl> <dbl>
1 Rose     22    12   3.5
2 Rose     12    13   4.8
3 Rose     18    14   5  
4 Tulip    44    21   9.2
5 Tulip    33    25   7.2
6 Tulip    38    12   9.4

edited Apr 03 '21 at 16:01

answered Apr 03 '21 at 15:53

Anoushiravan R

21,622
3
18
41

Thanks for your proposal! Works like a charm on my end with the toy data. Realized though that my names were actually not separated by an underscore but by a space. How would I have to change this to fit my data structure? Thx F.ex. "Rose right" / "Rose Yellow" / "Rose orange" "Rose dark orange" – aimbotter21 Apr 03 '21 at 16:04
My pleasure, I replaced the underscore in your sample data with white space and it worked. Have you tried it on the original data set yet? – Anoushiravan R Apr 03 '21 at 16:08
This also works `df %>% mutate(Name = str_extract(Name, "[A-Za-z]+\\s"))`. – Anoushiravan R Apr 03 '21 at 16:10
Again, much appreciated! It does work! Maybe just a little too well for my case :) But imagine, I have now two names where the first two words are the same, such as (only example I can think of right now) American Tulip red and American Tulip orange. They are both American now, but I would love to somehow retain those two prefixes then! (The longest shared prefix) – aimbotter21 Apr 03 '21 at 16:18

Chris Ruehlemann · Answer 2 · 2021-04-03T16:13:43.067

1

We can use backreference:

df$Name <- sub("(\\S+).*", "\\1", df$Name)

How this works:

(\\S+) is a capturing group matching any sequence of characters that are not whitespace
.* matches whatever comes after the capturing group
\\1 is the backreference; it 'recollects' the contents of the capturing group

Result:

df
# A tibble: 6 x 4
  Name   Var1  Var2  Var3
  <chr> <dbl> <dbl> <dbl>
1 Rose     22    12   3.5
2 Rose     12    13   4.8
3 Rose     18    14   5  
4 Tulip    44    21   9.2
5 Tulip    33    25   7.2
6 Tulip    38    12   9.4

edited Apr 03 '21 at 16:13

answered Apr 03 '21 at 16:08

Chris Ruehlemann

20,321
4
12
34

Awesome! Thanks for that proposal! How would I have to change it if the names are not separated by an underscore if at all. See edit in my original question. Thanks – aimbotter21 Apr 03 '21 at 16:10
And thanks for the clarification. Very useful for future reference! – aimbotter21 Apr 03 '21 at 16:12
The code should work for "Rose dark green" just as for "Rose dark". – Chris Ruehlemann Apr 03 '21 at 16:13

R - Handling character variables in data.table by matching values on the same prefix while truncating the diverging suffix

2 Answers2