0

I have a data frame sp which contains several species names but as they come from different databases, they are written in different ways.

For example, one specie can be called Urtica dioica and Urtica dioica L..

To correct this, I use the following code which extracs only the two first words from a row:

paste(strsplit(sp[i,"sp"]," ")[[1]][1],strsplit(sp[i,"sp"]," ")[[1]][2],sep=" ")

For now, this code is integrated in a for loop, which works but takes ages to finish:

for (i in seq_along(sp$sp)) {
    sp[i,"sp2"] = paste(strsplit(sp[i,"sp"]," ")[[1]][1],
                        strsplit(sp[i,"sp"]," ")[[1]][2],
                        sep=" ")
}

If there a way to improve this basic code using vectors or an apply function?

joran
  • 169,992
  • 32
  • 429
  • 468
user3443183
  • 115
  • 6

3 Answers3

1

You could just use vectorized regular expression functions:

library(stringr)
x <- c("Urtica dioica", "Urtica dioica L.")
> str_extract(string = x,"\\w+ \\w+")
[1] "Urtica dioica" "Urtica dioica"

I happen to have found stringr convenient here, but with the right regex for your specific data you could do this just as well with base functions like gsub.

joran
  • 169,992
  • 32
  • 429
  • 468
  • It works perfectly well ! Thank you. I do not know the regex to use with `gsub`. I tried this:`gsub(" .*$", "", x)` but it just keep the first word. – user3443183 Jul 24 '14 at 16:26
0

You might want to check to see if there are more than 2 words in the string before doing each extraction:

if((sapply(gregexpr("\\W+", i), length) + 1) > 2){
    ...
}
0

There's a function for that.

Also from stringr, the word function

> choices <- c("Urtica dioica", "Urtica dioica L..") 
> library(stringr)
> word(choices, 1:2)
# [1] "Urtica" "dioica"
> word(choices, rep(1:2, 2))
# [1] "Urtica" "dioica" "Urtica" "dioica"

These return individual strings. For two strings containing the first and last names,

> word(choices, 1, 2)
# [1] "Urtica dioica" "Urtica dioica"

The final line gets the first two words from each string in the vector choices

Rich Scriven
  • 97,041
  • 11
  • 181
  • 245