0

Apologies if a similar question has been asked before, I was unable to find one, possibly because of wording of the question.

Some current sample data looks like this, where the first column is a list of identifiers (genes) and the second column is a set of descriptors (gene ontology IDs):

Gene    Gene_Ontology_ID
Gene1   GO1, GO2, GO4, GO6
Gene2   GO2, GO3, GO4
Gene3   GO5, GO7

I am wondering if there is an efficient way to transform a large table formatted similarly so that the "Gene_Ontology_ID" column now serves as the indentifying column, and the "Gene" column is now a list of genes with that Gene_Ontology_ID, like this:

Gene_Ontology_ID    Gene
GO1                 Gene1
GO2                 Gene1,Gene2
GO3                 Gene2
GO4                 Gene1,Gene2
GO5                 Gene3
GO6                 Gene1
GO7                 Gene3

Is there a solution to this, preferably using Unix, Python, or R? Any help would be greatly appreciated, thank you.

cteno4
  • 3
  • 1

4 Answers4

1
library(dplyr)
library(tidyr)
out <- df %>% separate(Gene_Ontology_ID, into=paste("genes", 1:7, sep = "_"),sep =", ",fill="right") %>% 
  gather(key,Gene_Ontology_ID, -Gene,na.rm=TRUE) %>%
  arrange(Gene_Ontology_ID,Gene) %>%
  group_by(Gene_Ontology_ID) %>% 
  summarise(Gene= paste(Gene,collapse =", "))

out
# A tibble: 7 x 2
  Gene_Ontology_ID         Gene
             <chr>        <chr>
1              GO1        Gene1
2              GO2 Gene1, Gene2
3              GO3        Gene2
4              GO4 Gene1, Gene2
5              GO5        Gene3
6              GO6        Gene1
7              GO7        Gene3
Jason
  • 581
  • 3
  • 8
0

A similar option with strsplit and unnest:

library(dplyr)
library(tidyr)

df %>% 
  mutate(Gene_Ontology_ID = strsplit(Gene_Ontology_ID, ", ")) %>%
  unnest(Gene_Ontology_ID) %>% 
  group_by(Gene_Ontology_ID) %>% 
  summarise(Gene = paste(Gene, collapse = ", "))

# A tibble: 7 x 2
  Gene_Ontology_ID Gene        
  <chr>            <chr>       
1 GO1              Gene1       
2 GO2              Gene1, Gene2
3 GO3              Gene2       
4 GO4              Gene1, Gene2
5 GO5              Gene3       
6 GO6              Gene1       
7 GO7              Gene3      
tyluRp
  • 4,678
  • 2
  • 17
  • 36
0

Using data.table and cSplit from splitstackshape:-

library(data.table)
library(splitstackshape)
df <- data.frame(Gene = c("Gene1", "Gene2", "Gene3"), Gene_Ontology_ID = c("GO1, GO2, GO4, GO6", "GO2, GO3, GO4", "GO5, GO7"))
df <- cSplit(df, 'Gene_Ontology_ID', ',', 'long', drop = FALSE)
setDT(df)
df[, Gene := as.character(Gene)]
df[, Gene := paste0(Gene, collapse = ", "), by = Gene_Ontology_ID]
setcolorder(df, c("Gene_Ontology_ID", "Gene"))
df <- unique(df)

You'll get:-

 Gene_Ontology_ID         Gene
1:              GO1        Gene1
2:              GO2 Gene1, Gene2
3:              GO4 Gene1, Gene2
4:              GO6        Gene1
5:              GO3        Gene2
6:              GO5        Gene3
7:              GO7        Gene3
sm925
  • 2,648
  • 1
  • 16
  • 28
0

Here is a solution using only base R (probably not the most efficient):

# Obtain a vector of unique "gene ontology ids"
all_genes_id <- paste0(df$Gene_Ontology_ID, collapse = ", ")
all_genes_id <- unique(strsplit(all_genes_id, ", ")[[1]])

# Initalize and fill vector of genes per each "gene ontology ids"
genes_per_id <- vector(mode = "character", length(all_genes_id))
for(i in 1:length(all_genes_id)) {
  rows_df <- grepl(all_genes_id[i], df$Gene_Ontology_ID)
  genes_per_id[i] <- paste0(df$Gene[rows_df], collapse = ",")
}

# New data frame
df2 <- data.frame(Gene_Ontology_ID = all_genes_id,
                  Gene = genes_per_id)
df2
# Result
  Gene_Ontology_ID        Gene
1              GO1       Gene1
2              GO2 Gene1,Gene2
3              GO4 Gene1,Gene2
4              GO6       Gene1
5              GO3       Gene2
6              GO5       Gene3
7              GO7       Gene3
s_baldur
  • 29,441
  • 4
  • 36
  • 69