Basic Variable Lookup in R

Question

I have two files. One has a series of genes that I'm interested in. The other has genes and their pathways they are associated with. So the first list looks like this:

Solyc08g062250
Solyc02g069270
Solyc07g064990
Solyc09g065800
Solyc02g077620
Solyc01g104400
Solyc02g065290
Solyc02g090220

and another list with these genes and what "pathways" they belong to (this is a sample of the file, the file is much larger and has several pathways and genes):

Solyc10g008120  1,3,5-trimethoxybenzene biosynthesis
Solyc02g069920  1,4-dihydroxy-2-naphthoate biosynthesis I
Solyc04g005180  1,4-dihydroxy-2-naphthoate biosynthesis I
Solyc04g005190  1,4-dihydroxy-2-naphthoate biosynthesis I
Solyc04g005200  1,4-dihydroxy-2-naphthoate biosynthesis I
Solyc05g005180  1,4-dihydroxy-2-naphthoate biosynthesis I
Solyc06g071030  1,4-dihydroxy-2-naphthoate biosynthesis I

The catch is that several of my genes fall into several pathways. I need a good way to get each gene and have all of the pathways it is in charge of listed next to each gene ID that I input from a set.

I was originally trying to use the command

c<-b[b$GeneID %in% a$GeneIDs,]

where b was my pathway/GeneID and a was my list of Gene IDs that I wanted, but it only returns one pathway and I know a number of these genes fall into several pathways.

I'm new to programming entirely so I've been having trouble with this. Any help would be appreciated! I don't know how to search on Internet because I don't know what this is called.

You can look through the GeneID of the first one using `lapply` and then do `%in%`. i.e. `lapply(a$GeneIDs, function(x) b[b$GeneID ==x,])` The output will be a list of `data.frames` (assuming that is what you wanted) — akrun, Sep 10 '15 at 02:46
This question is very similar to this one: http://stackoverflow.com/questions/30331830/r-find-frequencies-over-3rd-quartile-in-table you may want to give the accepted answer a try — PavoDive, Sep 10 '15 at 03:40
please keep in mind that `c` is itself a function in R, and assigning variables to `c` might cause odd behavior — PavoDive, Sep 10 '15 at 04:18
Did any of the answers work for you? If so, please consider marking it as "accepted" by clicking the tick mark below the question votes — PavoDive, Sep 10 '15 at 10:32
Gimme some time PavoDive, I just woke up right now. I'll give the answers a try as I read down the list and understand the commands. I'm new at this so I need to look up the commands and how they work. — kevluv93, Sep 10 '15 at 11:28

score 0 · Answer 1 · answered Sep 10 '15 at 03:02

Would something like this help, using some toy data:

allgenes <- c("a", "b", "c")
dat <- data.frame(gene = c(rep("a", 2), "b", rep("c", 2)), 
   path = paste("path", 1:5))
dat
 gene   path
1    a path 1
2    a path 2
3    b path 3
4    c path 4
5    c path 5

Create a data frame with commas separating each path

res <- lapply(allgenes, function(y) paste(dat$path[which(dat$gene %in% y)], collapse=", "))
data.frame(allgenes, paths=do.call(rbind, res))

  allgenes          paths
1        a path 1, path 2
2        b         path 3
3        c path 4, path 5

score 0 · Answer 2 · answered Sep 10 '15 at 03:27

I'm thinking you want to do something like this.

library(dplyr)
library(magrittr)

gene.interested =
  data_frame(gene = c(
    "Solyc10g008120",
    "Solyc02g069920",
    "Solyc08g062250"))

gene__pathway =
  data_frame(
    gene = c(
      "Solyc10g008120",
      "Solyc02g069920",
      "Solyc02g069920",
      "Solyc04g005180"),
    pathway = c(
      "1,3,5-trimethoxybenzene biosynthesis",
      "1,4-dihydroxy-2-naphthoate biosynthesis I",
      "1,4-dihydroxy-2-naphthoate biosynthesis I",
      "1,4-dihydroxy-2-naphthoate biosynthesis I"))

result = 
  gene.interested %>%
  left_join(gene__pathway) %>%
  group_by(gene) %>%
  summarize(pathways = pathway %>% paste(collapse = "; "))

score 0 · Answer 3 · answered Sep 10 '15 at 03:27

This is how I understand what you are trying to do: (1) Loop through each gene in the first file, (2) Find all of pathways associated with that gene in the second file, (3) Create a new file or R object, where each item contains the gene and all associated pathways.

I assume that the number of pathways per gene varies. As such, you probably want to store the results of your search function in a list object. Also, I don't think there are any matches in the data sample you provided. I replaced a few entries in the pathways file with 'Solyc07g064990' to illustrate.

# GENE LOOKUP

# load data
# genes is loaded as vector to reduce clutter below
genes <- read.csv('genes.csv', header = F, stringsAsFactors = F)[,1]
pathways <- read.csv('pathways.csv', header = F, stringsAsFactors = F)

# create empty list to store gene/pathway matches
compiled <- list()

# loop through genes
for(i in genes)
{
      # store matching indices from pathways table
      matches <- grep(i,pathways[,1])
      # create new entry in 'compiled', giving it the current gene name (i)
      compiled[[i]] <- pathways[matches,2]
}

Also stuck this on github if you want to grab the sample data. https://github.com/brlancer/stackex/tree/master/gene%20var%20lookup

First time contributing to Stack Overflow btw. Feedback welcome!

PavoDive · Answer 4 · 2015-09-10T04:33:59.643

Here's my dplyr attempt (I'm on the process of learning dplyr, so any feedback to simplify is highly appreciated):

# Create some dummy data
dat <- data.frame(gene = c(rep("a", 2), "b", rep("c", 2)),path = paste("path", 1:5))

# Load dplyr

library(dplyr)    

dat %>% 
  group_by(gene) %>% 
  mutate(newpath=paste(.$path,collapse=", ")) %>% 
  distinct(gene) %>% 
  select(gene,newpath) %>%
  filter(gene %in% a$ID)

group_by does exactly that: groups by gene; then a new column is added (mutate which has a string made of the concatenation of all paths for each gene. There will be duplicated entries, so we need to keep distinct records. Last, we drop the initial path variable, as it's not needed any longer.

Result (for dummy data) looks like:

Source: local data table [3 x 2]

    gene        newpath
  (fctr)          (chr)
1      a path 1, path 2
2      a path 1, path 2
3      b         path 3

####### EDIT TO ADD #######

I was missing your request to see only those genes contained in your a data frame. To do that, please check the revised code (last line: filter).

for a data.table solution:

library(data.table)
dat[,newpath:=paste(path,collapse=", "),by=gene][!duplicated(gene)][gene %in% a$ID]

In the first of the chained commands, we are creating the new variable newpath with the concatenated paths, obviously grouped by gene. In the second "box" we are stating that we don't want duplicated records of gene. The last one filters only those genes in a$ID.

Basic Variable Lookup in R

4 Answers4