Why do results differ for dplyr left_join() and right_join() using these two dataframes

Question

I am learning how to use the R dplyr 'join' functions by doing the exercises from this course: https://github.com/uclouvain-cbio/WSBIM1207 and got stuck on the problem described below.

First, download the example dataframes used for this question:

BiocManager::install("UCLouvain-CBIO/rWSBIM1207")

Load the package:

library(rWSBIM1207)

Then in R/RStudio load the dataframe files, 'clinical2' and 'expression' by typing:

data(clinical2)
data(expression)

The task is, firstly: 'Join the expression and clinical2 tables by the patient reference, using the left_join and the right_join functions.' I did that in this way:

left_join(expression, clinical2, 
          by = c("patient" = "patientID"))
right_join(expression, clinical2,
                     by = c("patient" = "patientID"))

The second task is to explain why the results are different. I found that there are 3 more rows in the right_join output versus the left_join output. This seems odd to me given that 'clinical2' has 516 rows, whereas 'expression' has 570 rows. The 3 extra rows present in the r_join output have in common that they contain multiple NA values, which presumably represent patients found in 'clinical2' and not in 'expression'. I don't really understand what is going on here, and would be grateful for any help.

The online help has a discussion of the differences between the two joins. What part of that discussion is unclear? — Limey, Dec 30 '21 at 20:56

TarJae · Accepted Answer · 2021-12-31T11:12:23.010

Update Maybe not clearly explained. Here an explanation with the original data:

create left join

left <- left_join(expression, clinical2, by = c("patient" = "patientID"))

dim(expression)
[1] 570   8
dim(left)
[1] 570  10

create right join

right <- right_join(expression, clinical2,
           by = c("patient" = "patientID"))

dim(expression)
[1] 570   8
dim(right)
[1] 573  10

You want to know why dim(right) is 573!

Explanation step by step:

Definition of right_join(): includes all rows in y (y is here clinical2).
By doing so: there are 3 rows in clinical2 (y) which are not in expression (x)

See here: which patient occur in clinical2 but not in expression

anti_join(clinical2, expression, by=c("patientID"="patient"))

patientID    gender years_at_diagnosis
<chr>        <chr>               <dbl>
1 TCGA-55-7284 male                 74.2
2 TCGA-55-7913 female               61.2
3 TCGA-67-4679 male                 69.0

Again:

right_join(expression, clinical2, by = c("patient" = "patientID"))

We start with expression(x) (dim = 570 8) and join with clinical2 (y) (dim = 516 3)

So now what happens:

All 570 from expression (x) are matched with all in clinial2 (y) (dim = 516 3) EXCEPT these 3 patientID in clinical2 TCGA-55-7284, TCGA-55-7913, TCGA-67-4679
right_join now takes all 570 from expression and adds the 3 not matching patientID from clinical2 resulting in a dim of 573 10
In contrast left_join: left_join(): includes all rows in x (=expression), So if we do

  anti_join(expression, clinical2, by=c("patient"="patientID"))

We get:
  # ... with 8 variables: sampleID <fct>, patient <chr>, type <chr>, A1BG <dbl>, A1CF <dbl>,
  #   A2BP1 <dbl>, A2LD1 <dbl>, A2ML1 <dbl>````

That means all rows are included in expression. So no additional rows will be added here:

First answer: In joining two things are important:

from which side you start to join, e.g. which table is in first position
Given position of tables e.g. df1, df2 which join method you apply

See this example:

library(dplyr)
library(tibble)

# add ID
iris1<- iris %>% 
  tibble::rowid_to_column("ID")

# add ID
mtcars1 <- mtcars %>% 
  tibble::rowid_to_column("ID")

dim(iris1)
# [1] 150   6

dim(mtcars1)
# [1] 32 12

# 1. iris1 is first and we start from left e.g. iris1

a <- left_join(iris1, mtcars1, by="ID")
dim(a)
# [1] 150  17

# 2. iris1 is still first, but we join from right e.g. mtcars1
b <- right_join(iris1, mtcars1, by="ID")
dim(b)
# [1] 32 17

# 3. mtcars1 is first and we join from left e.g mtcars1
a1 <- left_join(mtcars1, iris1, by="ID")
dim(a1)
# [1] 32 17

-> b = a1 e.g. right_join(iris1, mtcars1, by="ID") = left_join(mtcars1, iris1, by="ID")

https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/join

The dim(a) and dim(b) values you provided make perfect sense. In the dataframe examples I provided, dim(clinical2) is 516 by 3 and dim(expression) is 570 by 8. dim(left_join) is 570 by 10, which makes sense. However, dim(right_join) is 573 by 10. Why is that? I am looking for an explanation for the output of right_join in my specific example, which has not been answered in the earlier posts. — JohnJ, Dec 31 '21 at 09:41
Please see my update. The simple answer is that 3 rows from clinical2 that are not included in expression are added to the new dataframe, because with `right_join` in this constellation we start with `expression` that has 570 rows. 570 from expression + 3 new ones from clinical2 gives 573. — TarJae, Dec 31 '21 at 11:14
TarJae this is now a really clear, excellent explanation, and has revealed a new dplyr term `anti-join` that I can see the usefulness of. One of the things which was kind of obvious that I initially overlooked is that in the 'expression' dataset the 'patient' column has many duplicate IDs: `> length(unique(expression$patient)) [1] 513` not 570. The discrepancy between the number of rows in 'expression' and 'clinical2' is thus easily explained! — JohnJ, Jan 01 '22 at 12:57