efficient programming in R

Question

I have a data like

author_id paper_id confirmed     author_name1   author_affiliation1         author_name   
   826    25733         1     Emanuele Buratti  Genetic engineering    Emanuele Buratti
   826    25733         1     Emanuele Buratti  International center   Emanuele Buratti
   826    47276         1     Emanuele Buratti                         Emanuele Buratti
   826    77012         1     Emanuele Buratti                         Emanuele Buratti
   826    77012         1     Emanuele Buratti                         Emanuele Buratti
   826    79468         1     Emanuele Buratti                         Emanuele Buratti

author_affiliation
Genetic enginereing                                                                                                
The International Centre for Genetic Engineering and Biotechnology, Padriciano 66,        
Trieste, Italy


International Centre for Genetic Engineering and Biotechnology, Padriciano 99, 34149                         
Trieste, Italy

Now I have to check for each row strindist between author_name and author_name1(name_dist) and the stringdist between author_affiliation vs author_affiliation1(aff_sit.

I am using

name_dist<-vector()
aff_dist<-vector()
for(i in 1:nrow(mer1))
{
 name_dist[i]<-stringdist(mer1$author_name1[i],mer1$author_name[i],method="lv")
 aff_dist[i]<-stringdist(mer1$author_affiliation1[i],mer1$author_affiliation[i],method="lv")

 }

But this is using a lot of time.How could this be done efficiently?

Thanks

A general comment on R: the reason many loops execute slowly is because vectors are dynamically grown. You can gain a lot of efficiency by preallocating the space which is done by adding a `length` argument to `vector()`. — Christopher Louden, Mar 24 '14 at 12:31

score 1 · Accepted Answer · answered Mar 24 '14 at 12:30

1

You can directly vectorize it

i=1:nrow(mer1)
name_dist<-stringdist(mer1$author_name1[i],mer1$author_name[i],method="lv")
aff_dist<-stringdist(mer1$author_affiliation1[i],mer1$author_affiliation[i],method="lv")

answered Mar 24 '14 at 12:30

TooTone

7,129
5
34
60

1

This is not completely correct code. You should loose the indices completely. – Feb 26 '16 at 17:05
@MarkvanderLoo good spot. In this case where we are processing the whole dataset it would be better without the `i` and `[i]`s. – TooTone Feb 26 '16 at 19:04

score 1 · Answer 2 · answered Mar 24 '14 at 12:36

1

You can use sapply (or some other vectorization method), like so:

a = letters[1:5] # your mer1$author_name1
b = LETTERS[1:5] # your mer1$author_name
name_dist = sapply(a, stringdist, b, method="lv")

answered Mar 24 '14 at 12:36

djas

973
8
24

Karsten W. · Answer 3 · 2014-03-26T14:25:51.973

0

Try

res <- transform(mer1, 
    name_dist=stringdist(author_name1,author_name,method="lv"),
    aff_dist=stringdist(author_affiliation1,author_affiliation,method="lv")
)

Since stringdist is a function capable of vector input, it should be more efficient this way.

edited Mar 26 '14 at 14:25

answered Mar 24 '14 at 12:22

Karsten W.

17,826
11
69
103

efficient programming in R

3 Answers3

Linked