7

Suppose I have two dataframes:

df1 <- data.frame(matrix(rnorm(10*10),ncol=10))
df2 <- data.frame(matrix(rnorm(10*10),ncol=10))
colnames(df1) <- 1:10
colnames(df2) <- 11:20

How do I use merge to cbind these (I already know about cbind but I am interested in the application of merge here).

Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197
user2763361
  • 3,789
  • 11
  • 45
  • 81

2 Answers2

7

There's nothing here on which to merge other than row names.

merge(df1, df2, by=0)

Note the order of the rows in the result! These are lexicaly sorted by row name. To get the proper order, use order:

x <- merge(df1, df2, by=0)
x[order(as.numeric(x$Row.names)),]
Matthew Lundberg
  • 42,009
  • 6
  • 90
  • 112
  • Nice. Why does the ordering get messed up though? Any way to preserve row ordering as if it was a `cbind`? – user2763361 Dec 16 '13 at 03:27
  • I assume that the order is the lexical order of the tokens. 1 < 10 < 2 < 3... The order can be fixed simply by sorting on the new 'name' column. I'll add that to the answer. – Matthew Lundberg Dec 16 '13 at 03:29
  • Why can't we merge off rownames? – user2763361 Dec 16 '13 at 03:30
  • cbind() isn't the same as merge()! Cbind concatenates columns for example: m <- cbind(1, 1:7). To use merge() you need something that both columns habe in common and "join" them using that key. There is a more intuitive approach if you know sql using the plyr package, join() function. – marbel Dec 16 '13 at 03:31
  • You can merge by row names, actually. I keep forgetting about that. The order is still a problem. – Matthew Lundberg Dec 16 '13 at 03:32
7

I have made the matrices smaller for display purposes.

> df1 <- data.frame(matrix(rnorm(15),ncol=3))
> df2 <- data.frame(matrix(rnorm(15),ncol=3))
> colnames(df1) <- paste0("A", 1:3)
> colnames(df2) <- paste0("B", 4:6)

We have data frames df1 and df2, with columns names A1, A2 & A3 and B1, B2 & B3 respectively.

So,as you know, cbind() just sticks the data frames together, side by side.

> cbind(df1, df2)
         A1        A2       A3        B4       B5        B6
1  2.055780  0.362796  1.25536 -1.748416  0.41855 -0.516635
2  0.010779  0.086778 -0.68413  1.183762 -1.20362  0.041147
3 -0.732393  0.235125 -0.89306  1.435362 -0.26066 -0.025933
4 -2.493843 -2.654263  0.36107  0.083018 -0.82251 -0.991135
5  0.935540  0.398196 -0.43043  0.470559 -0.54146  1.955555

merge() looks for common columns. In this case there are none, so it produces essentially an outer product in which each row of df1 is matched against each row of df2.

> merge(df1, df2)
          A1        A2       A3        B4       B5        B6
1   2.055780  0.362796  1.25536 -1.748416  0.41855 -0.516635
2   0.010779  0.086778 -0.68413 -1.748416  0.41855 -0.516635
3  -0.732393  0.235125 -0.89306 -1.748416  0.41855 -0.516635
4  -2.493843 -2.654263  0.36107 -1.748416  0.41855 -0.516635
5   0.935540  0.398196 -0.43043 -1.748416  0.41855 -0.516635
6   2.055780  0.362796  1.25536  1.183762 -1.20362  0.041147
7   0.010779  0.086778 -0.68413  1.183762 -1.20362  0.041147
8  -0.732393  0.235125 -0.89306  1.183762 -1.20362  0.041147
9  -2.493843 -2.654263  0.36107  1.183762 -1.20362  0.041147
10  0.935540  0.398196 -0.43043  1.183762 -1.20362  0.041147
11  2.055780  0.362796  1.25536  1.435362 -0.26066 -0.025933
12  0.010779  0.086778 -0.68413  1.435362 -0.26066 -0.025933
13 -0.732393  0.235125 -0.89306  1.435362 -0.26066 -0.025933
14 -2.493843 -2.654263  0.36107  1.435362 -0.26066 -0.025933
15  0.935540  0.398196 -0.43043  1.435362 -0.26066 -0.025933
16  2.055780  0.362796  1.25536  0.083018 -0.82251 -0.991135
17  0.010779  0.086778 -0.68413  0.083018 -0.82251 -0.991135
18 -0.732393  0.235125 -0.89306  0.083018 -0.82251 -0.991135
19 -2.493843 -2.654263  0.36107  0.083018 -0.82251 -0.991135
20  0.935540  0.398196 -0.43043  0.083018 -0.82251 -0.991135
21  2.055780  0.362796  1.25536  0.470559 -0.54146  1.955555
22  0.010779  0.086778 -0.68413  0.470559 -0.54146  1.955555
23 -0.732393  0.235125 -0.89306  0.470559 -0.54146  1.955555
24 -2.493843 -2.654263  0.36107  0.470559 -0.54146  1.955555
25  0.935540  0.398196 -0.43043  0.470559 -0.54146  1.955555

If we rename the first column in df1 so that it matches the name of the first column in df2 then merge() looks for common values in those two columns. Since there are no common values, the output is empty.

> colnames(df1)[1] = "B4"
> merge(df1, df2)
[1] B4 A2 A3 B5 B6
<0 rows> (or 0-length row.names)

But now if we copy (and reverse, just to make things interesting!) the first column of df2 into the first column of df1...

> df1$B4 = rev(df2$B4)
> df1
        B4       A2        A3
1 -0.50647 -0.48128  0.540799
2 -0.70684 -0.35401  0.872514
3  0.14341  1.12184 -0.079913
4 -0.59989  0.81912  1.726494
5  0.33864  0.85277  0.386702
> df2
        B4       B5        B6
1  0.33864  1.83677  0.406717
2 -0.59989 -0.43630  0.075029
3  0.14341  1.01496  0.095534
4 -0.70684  1.32414 -0.122613
5 -0.50647  0.70709 -0.700225

... and try to merge again...

> merge(df1, df2)
        B4       A2        A3       B5        B6
1 -0.70684 -0.35401  0.872514  1.32414 -0.122613
2 -0.59989  0.81912  1.726494 -0.43630  0.075029
3 -0.50647 -0.48128  0.540799  0.70709 -0.700225
4  0.14341  1.12184 -0.079913  1.01496  0.095534
5  0.33864  0.85277  0.386702  1.83677  0.406717

... we finally get something meaningful: the rows of df1 and df2 are stuck together according to values in df1$B4 matching values in df2$B4.

I hope that this helps.

datawookie
  • 1,607
  • 12
  • 20