I currently have a seemingly easy task, but as the data set is quite large, it is not feasible to do the task manually by hand.
The data.table
stores one column of name
and numerous other columns storing data corresponding to that name object, say var1
, var2
and var3
. The name
variable is type character and the others are type numeric.
Now to the question. Names can start the same (same prefix) but might end differntly (diverging suffix). Imagine for example plant species of different classes or company names with differnt legal entitity form.
My generic question is, whether there is a simple method to match all values of the name
variable that have the same prefix and assign them this "same" prefix while discarding the "diverging" suffix. See below for "abstract" example data.
EDIT Names are not separated by an underscore but by a space and can have several words (not solely constrained to two words)
EDIT 2 The longest shared prefix should be retained. Added some examples to illustrate that point. Proposed solutions would result in the last to entries being "American" instead "American Tulip".
Name Var1 Var2 Var3
Rose red 22 12 3.5
Rose Yellow 12 13 4.8
Rose orange 18 14 5.0
Rose dark orange 17 12 4.9
Tulip red 44 21 9.2
Tulip yellow 33 25 7.2
Tulip Orange 38 12 9.4
American Tulip red 22 12 3.3
American Tulip orange 31 16 4.5
I want to get to a "tidy" table like below to allow aggregation of data and computing statistics for each "most hierachical class".
Name Var1 Var2 Var3
Rose 22 12 3.5
Rose 12 13 4.8
Rose 18 14 5.0
Rose 17 12 4.9
Tulip 44 21 9.2
Tulip 33 25 7.2
Tulip 38 12 9.4
American Tulip 22 12 3.3
American Tulip 31 16 4.5
I would highly appreciate any input, as I need to apply this logic to a data set with ~ 250k entries. Thanks in advance for any advice!