0

Given a vector of character strings, where each string is a comma-separated list of species names (i.e. Genus species). Each string can have a variable number of species in it (e.g. as shown in the example below, the number of species in a given string ranges from 1 to 3).

trees <-  c("Erythrina poeppigiana", "Erythrina poeppigiana, Juglans regia x Juglans nigra", "Erythrina poeppigiana, Juglans regia x Juglans nigra, Chloroleucon eurycyclum") 

I wish to obtain a vector of character strings of the same length, but where each string is a comma-separated list of the genus portions of the names only

genera <- c("Erythrina", "Erythrina, Juglans", "Erythrina, Juglans, Chloroleucon")

The screwy species is the "Juglans regia x Juglans nigra" hyrbid species. This should just come out as "Juglans", as it is all contained between two commas and is therefore just one species. In hybrids like this, the genus is always the same on both sides of the "x", so just the first word in that portion of the string is fine, just like with the more standard cases. However, solutions that attempt to pull out "every other word" won't work due to these hybrids.

My attempt was to first strsplit by ", " to separate out the individual species names, then strsplit again by " " to separate out the genus names:

    split.list <- sapply(strsplit(trees, split=", "), strsplit, 1, split=" ")
    split.list
    [[1]]
[[1]][[1]]
[1] "Erythrina"   "poeppigiana"


[[2]]
[[2]][[1]]
[1] "Erythrina"   "poeppigiana"

[[2]][[2]]
[1] "Juglans" "regia"   "x"       "Juglans" "nigra"  


[[3]]
[[3]][[1]]
[1] "Erythrina"   "poeppigiana"

[[3]][[2]]
[1] "Juglans" "regia"   "x"       "Juglans" "nigra"  

[[3]][[3]]
[1] "Chloroleucon" "eurycyclum"

But then the indexing to pull out the genus names and recombine is quite complicated (and I can't even figure it out!). Is there a cleaner solution for an ordered split and recombination?

It would also be acceptable to leverage the fact that genus names are the only words that are capitalized in all string. Maybe a regex that pull just words with capital letters?

Kevin W
  • 33
  • 5
  • 1
    The way it is shown in the original question is correct. Each string in the vector can have a variable length of number of species. I have edited the question to improve clarity on this. – Kevin W Feb 24 '17 at 14:11
  • In the future, if possible, it would be a better structure into a list. Like this: `trees <- list(c("Erythrina poeppigiana"), c("Erythrina poeppigiana", "Terminalia amazonia"), c("Erythrina poeppigiana", "Terminalia amazonia", "Chloroleucon eurycyclum"))`. – lmo Feb 24 '17 at 14:21

1 Answers1

2

Here is an idea via Base R,

sapply(strsplit(trees, ' '), function(i) toString(i[c(TRUE, FALSE)]))
#[1] "Erythrina"    "Erythrina, Terminalia"         "Erythrina, Terminalia, Chloroleucon"

EDIT

Further to your comment, for the new trees, you can simply do,

sapply(strsplit(trees, ', '), function(i) toString(sub('\\s+.*', '', i)))
#[1] "Erythrina, Juglans"               "Erythrina"                       
#[3] "Erythrina, Juglans, Chloroleucon"
Sotos
  • 51,121
  • 6
  • 32
  • 66
  • This works perfectly for the example I gave. I just tried it on my full dataset, though, at it made me realize that I have some species that are hybrids, that could be in the list: trees <- c(""Juglans regia x Juglans nigra"", ""Juglans regia x Juglans nigra", Terminalia amazonia", ""Juglans regia x Juglans nigra", Terminalia amazonia, Chloroleucon eurycyclum"). "Juglans regia x Juglans nigra" in this case just needs to come out as "Juglans". – Kevin W Feb 24 '17 at 14:20
  • I edited the original question in include this. Sorry, I didn't even realize that there were those hybrids in the list! – Kevin W Feb 24 '17 at 14:26
  • Your solution seems to only work on the example because of how the example happens to add a species in each element. I just rearranged the elements of "trees", and your solution gives the same answer. Try: trees <- c("Erythrina poeppigiana, Juglans regia x Juglans nigra", "Erythrina poeppigiana", "Erythrina poeppigiana, Juglans regia x Juglans nigra, Chloroleucon eurycyclum") – Kevin W Feb 24 '17 at 15:37
  • ok, will revise again in a few minutes. In the future make sure you include all "corner" cases in your examples – Sotos Feb 24 '17 at 16:07