applying alternate to for loop in R

Question

I am looking for a very efficient solution for for loop in R

where data_papers is

data_papers<-c(1,3, 47276 77012 77012 79468....)

paper_author:

   paper_id author_id
1        1    521630
2        1    972575
3        1   1528710
4        1   1611750
5        2   1682088

I need to find the authors which are present in paper_author for a given paper in data_papers.There are around 350,000 papers in data_papers to around 2,100,000 papers in paper_author.

So my output would be a list of author_id for paper_ids in data_paper

authors:
 [[1]]
 [1]     521630   972575  1528710  1611710

 [[2]]
 [1]     826   338038  788465 1256860 1671245 2164912

 [[3]]
 [1]     366653 1570981 1603466

The simplest way to do this would be

authors<-vector("list",length(data_papers))
for(i in 1:length(data_papers)){
 authors[i]<-as.data.frame(paper_author$author_id[which(paper_author$paper_id%in%data_papers[i])])}

But the computation time is very high

The other alternative is something like below taken from efficient programming in R

i=1:length(data_papers)
authors[i]<-as.data.frame(paper_author$author_id[which(paper_author$paper_id%in%data_papers[i])])

But i am not able to do this.

How could this be done.thanks

score 2 · Accepted Answer · answered Apr 09 '14 at 10:13

2

with(paper_author, split(author_id,paper_id))

answered Apr 09 '14 at 10:13

Miff

7,486
20
20

score 1 · Answer 2 · answered Apr 09 '14 at 09:25

1

Or you could use R's merge function?

merge(data_papers, paper_author, by=1)

answered Apr 09 '14 at 09:25

Gavin Kelly

2,374
1
10
13

score 0 · Answer 3 · answered Apr 09 '14 at 07:48

Why are you not able to use this second solution you mentioned? Information on why would be useful.

In any case, what you want to do is to join two tables (data_papers and paper_authors). Doing it with pure nested loops, as your sample code does in either R for loops or the C for loops underlying vector operations, is pretty inefficient. You could use some kind of index data structure, based on e.g. the hash package, but it's a lot of work.

Instead, just use a database. They're built for this sort of thing. sqldf even lets you embed one into R.

install.packages("sqldf")
require(sqldf)

#you probably want to dig into the indexing options available here as well
combined <- sqldf("select distinct author_id from paper_author pa inner join data_papers dp on dp.paper_id = pa.paper_id where dp.paper_id = 1234;")

applying alternate to for loop in R

3 Answers3