Merge, cbind: How to merge better?

Question

I want to merge multiple vectors to a data frame. There are two variables, city and id that are going to be used for matching vectors to data frame.

df <- data.frame(array(NA, dim =c(10*50, 2)))
names(df)<-c("city", "id")
df[,1]<-rep(1:50, each=10)
df[,2]<-rep(1:10, 50)

I created a data frame like this. To this data frame, I want to merge 50 vectors that each corresponds to 50 cities. The problem is that each city only has 6 obs. Each city will have 4 NAs.

To give you an example, city 1 data looks like this:

seed(1234) 
cbind(city=1,id=sample(1:10,6),obs=rnorm(6))

I have 50 cities data and I want to merge them to one column in df. I have tried the following code:

for(i in 1:50){
citydata<-cbind(city=i,id=sample(1:10,6),obs=rnorm(6))   # each city data
df<-merge(df,citydata, by=c("city", "id"), all=TRUE)}    # merge to df

But if I run this, the loop will show warnings like this:

In merge.data.frame(df, citydata, by = c("city", "id"),  ... :
  column names ‘obs.x’, ‘obs.y’ are duplicated in the result

and it will create 50 columns, instead of one long column.

How can I merge cbind(city=i,id=sample(1:10,6),obs=rnorm(6)) to df in a one nice and long column? It seems both cbind and merge are not ways to go.

In case there are 50 citydata (each has 6 rows), I can rbind them as one long data and use data.table approach or expand.gird+merge approach as Philip and Jaap suggested.

I wonder if I can merge each citydata through a loop one by one, instead of rbind them and merge it to df.

score 4 · Accepted Answer · answered Mar 06 '16 at 18:16

data.table is good for this:

library(data.table)
df <- data.table(df)
> df
     city id
  1:    1  1
  2:    1  2
  3:    1  3
  4:    1  4
  5:    1  5
 ---
496:   50  6
497:   50  7
498:   50  8
499:   50  9
500:   50 10

I'm using CJ instead of your for loop to make some dummy data. CJ cross-joins each column against each value of each other column, so it makes a two-column table with each possible pair of values of city and id. The [,obs:=rnorm(.N)] command adds a third column that draws random values (without recycling them as it would if it were inside the CJ)--.N means "# rows of this table" in this context.

citydata <- CJ(city=1:50,id=1:6)[,obs:=rnorm(.N)]
> citydata
     city id         obs
  1:    1  1  0.19168335
  2:    1  2  0.35753229
  3:    1  3  1.35707865
  4:    1  4  1.91871907
  5:    1  5 -0.56961647
 ---
296:   50  2  0.30592659
297:   50  3 -0.44989646
298:   50  4  0.05359738
299:   50  5 -0.57494269
300:   50  6  0.09565473

setkey(df,city,id)
setkey(citydata,city,id)

As these two tables have the same key columns the following looks up rows of df by the key columns in citydata, then defines obs in df by looking up the value in citydata. Therefore the resulting object is the original df but with obs defined wherever it was defined in citydata:

df[citydata,obs:=i.obs]
> df
     city id         obs
  1:    1  1  0.19168335
  2:    1  2  0.35753229
  3:    1  3  1.35707865
  4:    1  4  1.91871907
  5:    1  5 -0.56961647
 ---
496:   50  6  0.09565473
497:   50  7          NA
498:   50  8          NA
499:   50  9          NA
500:   50 10          NA

or evrything in one go: `setDT(df)[CJ(city=1:50,id=1:6)[,obs:=rnorm(.N)], obs:=i.obs, on = c("city","id")][]` — Jaap, Mar 06 '16 at 18:32
Nice one. (However I'm also guessing OP has pre-existing real city data, so the creation of the join-table is more for showing off that the merge works rather than something to be a part of the actual solution workflow.) — Philip, Mar 06 '16 at 18:36
Thanx. Looking at the question, I think it is highly probably the `citydata` are the already existing data and the `df` has to be constructed with a `CJ` in order to get the desired result. — Jaap, Mar 06 '16 at 19:38

Jaap · Answer 2 · 2016-03-06T19:53:43.907

In base R you can do this with a combination of expand.grid and merge:

citydata <- expand.grid(city=1:50,id=1:6)
citydata$obs <- rnorm(nrow(citydata))

res <- merge(df, citydata, by = c("city","id"), all.x = TRUE)

which gives:

> head(res,12)
    city id        obs
 1:    1  1 -0.3121133
 2:    1  2 -1.3554576
 3:    1  3 -0.9056468
 4:    1  4 -0.6511869
 5:    1  5 -1.0447499
 6:    1  6  1.5939187
 7:    1  7         NA
 8:    1  8         NA
 9:    1  9         NA
10:    1 10         NA
11:    2  1  0.5423479
12:    2  2 -2.3663335

A similar approach with dplyr and tidyr:

library(dplyr)
library(tidyr)

res <- crossing(city=1:50,id=1:6) %>% 
  mutate(obs = rnorm(n())) %>%
  right_join(., df, by = c("city","id"))

which gives:

> res
Source: local data frame [500 x 3]

    city    id        obs
   (int) (int)      (dbl)
1      1     1 -0.5335660
2      1     2  1.0582001
3      1     3 -1.3888310
4      1     4  1.8519262
5      1     5 -0.9971686
6      1     6  1.3508046
7      1     7         NA
8      1     8         NA
9      1     9         NA
10     1    10         NA
..   ...   ...        ...

Merge, cbind: How to merge better?

2 Answers2