1

I want to collapse the rows of dataframe to create the orthologe group of each othologe and its corresponding genes.

For example:

Column A Column B
Ortho1 gene1
Ortho2 gene2, gene3
Ortho3 gene4, gene5, gene6
Ortho4 gene5, gene6
Ortho5 gene6, gene7
Ortho6 gene1, gene8

to be :

Column A Column B
Ortho1, Ortho6 gene1, gene8
Ortho2 gene2, gene3
Ortho3, Ortho4, Ortho5 gene4, gene5, gene6, gene7

I have tried to merge them, however it requires id, which I do not provide by data. Also for loop to find intersect(). Feels like, there is a simpler way to overcome this bottleneck.

  • the original data was like
Column A Column B
Ortho1 gene1
Ortho2 gene2
Ortho2 gene3

...

zx8754
  • 52,746
  • 12
  • 114
  • 209
Jin_soo
  • 65
  • 6
  • Is the only rule for defining an ortholog group: They have one or more genes in common? – Seth May 04 '23 at 14:48
  • @Seth one ortholog can have mulitple genes. one gene can be contirbuted to multiple orthologs. So I want to cluter them as one group. Would this explaination helps you? – Jin_soo May 04 '23 at 14:51

1 Answers1

2

The data is similar to a graph with nodes and edges connecting them. One solution would be to use the igraph package to take care of finding the non-overlapping groups. You can do

library(igraph)
dd %>% 
  tidyr::separate_rows(`Column B`) %>% 
  graph_from_data_frame(vertices=rbind(
    data.frame(v=unique(.$`Column A`), type="ortho"), 
    data.frame(v=unique(.$`Column B`), type="gene"))) %>% 
  decompose() %>% 
  purrr::map_df(function(g) {
    data.frame(
      "Column A" = paste((V(g)$name[V(g)$type=="ortho"]), collapse = ","),
      "Column B" = paste((V(g)$name[V(g)$type=="gene"]), collapse = ",")
    )
  })

Which will return

              Column.A                Column.B
1        Ortho1,Ortho6             gene1,gene8
2               Ortho2             gene2,gene3
3 Ortho3,Ortho4,Ortho5 gene4,gene5,gene6,gene7
MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • thank you for the comment, however after applying the code. I only acquire two column with one row, where all the ortho and gene collapses. – Jin_soo May 04 '23 at 15:06
  • can you please explain the code a bit more? – Jin_soo May 04 '23 at 15:06
  • 1
    Is your test data different than your actual data? It sounds like in your real data all your values are connected to each other. It's not clear how you would create separate rows in that case. – MrFlick May 04 '23 at 15:10
  • actual data is around 9,704 rows bit more than example, however structure-wise same. If I set `graph_from_data_frame` as not directed, would it be better to group them? – Jin_soo May 04 '23 at 15:18
  • 1
    It doesn't matter if it's directed or undirected when using `igraph::decompose`. They are still connected. – MrFlick May 04 '23 at 15:26
  • applying original data (each ortholog for each gene) to `graph_from_data_frame ` and applying function gives empty rows. – Jin_soo May 04 '23 at 15:30
  • was there reason you used `tidyr::separate_rows`? seems the function creates weird variable. – Jin_soo May 04 '23 at 15:41
  • specifying `separate_rows` 's option `sep = ` solved the problem of weird variable (it was prefix of variables). However still the result does not appear. Dataframe looks empty. – Jin_soo May 04 '23 at 15:47
  • I made it...Your code was working. Foud all my mistakes! – Jin_soo May 04 '23 at 15:55