Speed up a loop in r, using character strings simplification

Question

I have a data frame sp which contains several species names but as they come from different databases, they are written in different ways.

For example, one specie can be called Urtica dioica and Urtica dioica L..

To correct this, I use the following code which extracs only the two first words from a row:

paste(strsplit(sp[i,"sp"]," ")[[1]][1],strsplit(sp[i,"sp"]," ")[[1]][2],sep=" ")

For now, this code is integrated in a for loop, which works but takes ages to finish:

for (i in seq_along(sp$sp)) {
    sp[i,"sp2"] = paste(strsplit(sp[i,"sp"]," ")[[1]][1],
                        strsplit(sp[i,"sp"]," ")[[1]][2],
                        sep=" ")
}

If there a way to improve this basic code using vectors or an apply function?

score 1 · Accepted Answer · answered Jul 24 '14 at 15:29

1

You could just use vectorized regular expression functions:

library(stringr)
x <- c("Urtica dioica", "Urtica dioica L.")
> str_extract(string = x,"\\w+ \\w+")
[1] "Urtica dioica" "Urtica dioica"

I happen to have found stringr convenient here, but with the right regex for your specific data you could do this just as well with base functions like gsub.

answered Jul 24 '14 at 15:29

joran

169,992
32
429
468

It works perfectly well ! Thank you. I do not know the regex to use with `gsub`. I tried this:`gsub(" .*$", "", x)` but it just keep the first word. – user3443183 Jul 24 '14 at 16:26

tom.purucker · Answer 2 · 2014-07-24T15:24:29.010

0

You might want to check to see if there are more than 2 words in the string before doing each extraction:

if((sapply(gregexpr("\\W+", i), length) + 1) > 2){
    ...
}

edited Jul 24 '14 at 15:24

answered Jul 24 '14 at 15:16

tom.purucker

71
5

Yes, it could be useful, sometimes I only have the family and not the specie. Thanks. – user3443183 Jul 24 '14 at 16:23

Rich Scriven · Answer 3 · 2014-07-24T16:32:52.597

0

There's a function for that.

Also from stringr, the word function

> choices <- c("Urtica dioica", "Urtica dioica L..") 
> library(stringr)
> word(choices, 1:2)
# [1] "Urtica" "dioica"
> word(choices, rep(1:2, 2))
# [1] "Urtica" "dioica" "Urtica" "dioica"

These return individual strings. For two strings containing the first and last names,

> word(choices, 1, 2)
# [1] "Urtica dioica" "Urtica dioica"

The final line gets the first two words from each string in the vector choices

edited Jul 24 '14 at 16:32

answered Jul 24 '14 at 15:59

Rich Scriven

97,041
11
181
245

I did not know the `word` function, thanks ! After that, I need to use the `paste` function to merge the outputs and it works. – user3443183 Jul 24 '14 at 16:25
@user3443183 Actually, `word` does that too, see my edit – Rich Scriven Jul 24 '14 at 16:32
Indeed, a very useful function. – user3443183 Jul 24 '14 at 16:44

Speed up a loop in r, using character strings simplification

3 Answers3