1

I have a data frame with many rows (>9000) and columns (148). The first column has a unique code for experiments, the other columns are populated with the name of the clones tested in the experiment. I want a matrix that has the number of common clones among each experiment (pairwise).

Example of my dataset:

Exp_No    Clone1    Clone2   Clone3    Clone4
Exp1      Egxn2     Egxn11   Egxn6     Egxn13
Exp2      Egxn4     Egxn13   Egxn16    Egxn6
Exp3      Egxn2     Egxn6    Egxn11    Egxn18
Exp4      Egxn6     Egxn14   Egxn4     Egxn18
Exp5      Egxn2     Egxn11   Egxn6     Egxn13
Exp6      Egxn4     Egxn2    Egxn5     Egxn18

What I need:

Exp1  Exp2  2
Exp1  Exp3  3
Exp1  Exp4  1    
Exp1  Exp5  4
Exp1  Exp6  1
Exp2  Exp3  1
Exp2  Exp4  2
...

and so on for all pairs of rows. Any suggestion? Thank you in advance, been on this for a few hours! I could not find a way to resolve this.

Darren Tsai
  • 32,117
  • 5
  • 21
  • 51
Ilaria
  • 11
  • 2

2 Answers2

1

You can stack all of the Clone columns and then merge it to itself by the names of the clones.

Since dplyr 1.1.0
library(dplyr)

df_long <- df %>%
  tidyr::pivot_longer(contains('Clone'), names_to = NULL)

df_long %>%
  inner_join(df_long, by = join_by(value, y$Exp_No > x$Exp_No)) %>%
  count(Exp_No.x, Exp_No.y)

# # A tibble: 15 × 3
#    Exp_No.x Exp_No.y     n
#    <chr>    <chr>    <int>
#  1 Exp1     Exp2         2
#  2 Exp1     Exp3         3
#  3 Exp1     Exp4         1
#  4 Exp1     Exp5         4
#  5 Exp1     Exp6         1
#  6 Exp2     Exp3         1
#  7 Exp2     Exp4         2
#  8 Exp2     Exp5         2
#  9 Exp2     Exp6         1
# 10 Exp3     Exp4         2
# 11 Exp3     Exp5         3
# 12 Exp3     Exp6         2
# 13 Exp4     Exp5         1
# 14 Exp4     Exp6         2
# 15 Exp5     Exp6         1

dplyr 1.0.0 or older
df_long %>%
  inner_join(df_long, by = "value") %>%
  filter(Exp_No.y > Exp_No.x) %>%
  count(Exp_No.x, Exp_No.y)

Data
df <- read.table(text = "
Exp_No    Clone1    Clone2   Clone3    Clone4
Exp1      Egxn2     Egxn11   Egxn6     Egxn13
Exp2      Egxn4     Egxn13   Egxn16    Egxn6
Exp3      Egxn2     Egxn6    Egxn11    Egxn18
Exp4      Egxn6     Egxn14   Egxn4     Egxn18
Exp5      Egxn2     Egxn11   Egxn6     Egxn13
Exp6      Egxn4     Egxn2    Egxn5     Egxn18", header = TRUE)
Darren Tsai
  • 32,117
  • 5
  • 21
  • 51
  • Thank you Darren, it works! I just need to split the file as it runs out of memory, but it work! Thanks a limmion! – Ilaria Feb 04 '23 at 15:51
1

Update based on comment:

tcrossprod on a sparse matrix will give good performance. Demonstrating with 131 rows and 165 columns, with NAs present:

library(Matrix) # for sparse matrices
library(data.table) # final solution will be stored as a data.table

m <- as(
  triu( # get the upper triangle of the symmetric matrix
    tcrossprod( # tcrossprod to get the pairwise common clone counts
      # convert the data.frame to a sparse matrix with nrow(df) rows and
      # length(unique(unlist(df[,-1]))) columns (the number of unique clones
      # in the dataset)
      sparseMatrix(
        rep(1:131, 165)[i <- which(!is.na(cl <- unlist(df[,-1], 0, 0)))],
        as.numeric(gsub("Egxn", "", cl[i])),
        x = 1L
      )
    ), k = 1 # don't keep the diagonal (comparing rows with themselves)
  ), "TsparseMatrix" # set the result as a triangular matrix
)

# build the final answer
dtPairs <- setorder(
  data.table(
    # the Exp1 and Exp2 columns are row indices from df
    # sparse matrix indices are zero-based, so add one
    Exp1 = attr(m, "i") + 1L,
    Exp2 = attr(m, "j") + 1L,
    Common = attr(m, "x")
  ), Exp1, Exp2 # sort by Exp1 then by Exp2
)

dtPairs[1:10,]
#>     Exp1 Exp2 Common
#>  1:    1    2     18
#>  2:    1    3     26
#>  3:    1    4     11
#>  4:    1    5     13
#>  5:    1    6     25
#>  6:    1    7      6
#>  7:    1    8     10
#>  8:    1    9     17
#>  9:    1   10     23
#> 10:    1   11     13

nrow(dtPairs)
#> [1] 8515

Data:

df <- cbind(
  data.frame(Exp_No = paste0("Exp", 1:131)),
  matrix(
    replicate(131, c(sample(paste0("Egxn", 1:1e3), cl <- sample(100:165, 1)), rep(NA_character_, 165L - cl))),
    131, 165, 1, list(NULL, paste0("Clone", 1:165))
  )
)
jblood94
  • 10,340
  • 1
  • 10
  • 15
  • Hi jblood94 and @Darren Tsai, when used on the whole dataset, I cannot get the codes to work. I Have eliminated some records that I can do without and my dataset is now made by 131 experiments where a varying number of clones is planted (max 165). Because of the number of clones planted at each experiment is not constant, some of the rows will have NAs. Is there a way for me to upload the csv so that you can run your script? I am not sure if the NAs are causing the script to count the number of common clones incorrectly..Thanks again for the help, much appreciated! – Ilaria Feb 06 '23 at 13:24
  • You just need to remove the `NA`s when building the sparse matrix. See the update. – jblood94 Feb 06 '23 at 14:07
  • Hi jblood94, thank you for helping again! Will try to run the script again. Ilaria – Ilaria Feb 15 '23 at 06:56