1

I am trying to separate data into columns using "\n" in rstudio and then separate that data further into rows using "\t". So far I have been able to separate the data by "\n", but I can't figure out how to further split the data by "\t". I can't find any header names in the data I am using since its a table that I downloaded from the msigdb website. Here's what I have so far: matrix_sep_by_enter<-read.table("msigdb.v5.2.symbols.txt",sep = "\n")

how do I further separate this using "\t"

Thank you!

user2554330
  • 37,248
  • 4
  • 43
  • 90
  • Could you please provide a reproducible example? It is tedious if we have to simulate data.frames ourself or download them externally before we can start answering your question. – JereB Jan 23 '18 at 23:45
  • While I generally agree with @JereB, regrettably, the MSigDB files are fairly awkward and large for sharing. Alternatively, you'd have to sign up for a (free) account if you want to download the database file from Broad yourself. In general, this should be a simple matter of (1) reading the file line by line (e.g. using `read.table(..., sep = "\n")`, and then (2) splitting every line based on `"\t"` using `strsplit(..., "\t")`. Please take a look at the example I'm giving below. – Maurits Evers Jan 24 '18 at 00:04

1 Answers1

2

I'm not entirely sure how you want to parse the MSigDB. I've downloaded the latest MSigDB GMT file, so I'll show you a possibility based on that file.

  1. Read GMT file.

    df <- read.table("msigdb.v6.1.symbols.gmt", sep = "\n");
    

    This creates a data.frame with one column and as many rows as there are lines in the GMT file.

  2. Split every line into substrings based on "\t"

    lst <- apply(df, 1, function(x) unname(unlist(strsplit(x, "\t"))));
    

    The result is stored in a list of character vectors (of different lengths), where the first entry gives the gene set name, the second entry the MSigDB gene set weblink, and the remaining entries are the gene symbols associated with that gene set.

    str(lst, list.len = 5);
    #List of 17786
    # $ : chr [1:195] "AAANWWTGC_UNKNOWN" "http://www.broadinstitute.org/gsea/msigdb/cards/AAANWWTGC_UNKNOWN" "MEF2C" "ATP1B1" ...
    # $ : chr [1:376] "AAAYRNCTG_UNKNOWN" "http://www.broadinstitute.org/gsea/msigdb/cards/AAAYRNCTG_UNKNOWN" "LTBP1" "PLEKHM1" ...
    # $ : chr [1:267] "MYOD_01" "http://www.broadinstitute.org/gsea/msigdb/cards/MYOD_01" "KCNE1L" "FAM126A" ...
    # $ : chr [1:255] "E47_01" "http://www.broadinstitute.org/gsea/msigdb/cards/E47_01" "MLIP" "FAM126A" ...
    # $ : chr [1:251] "CMYB_01" "http://www.broadinstitute.org/gsea/msigdb/cards/CMYB_01" "FAM126A" "C5orf64" ...
    #  [list output truncated]
    
Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • Thank you! I understand how you made the data into a list of vectors, and I'm trying to place those vectors into a matrix column by column but I'm not able to, every time I try adding a section of the list to a matrix I get an error saying that the types don't match. Is there a way for me to add the character vectors to the matrix so that each character occupies a new cell? – Rohan Singhal Jan 24 '18 at 02:15
  • The short answer is that you can't place the vectors into a `matrix` (or `data.frame`) because they all have different lengths. If they *had* the sample length, this would be a simple matter of using `cbind.data.frame` to combine the columns. The long(er) answer is that you *could* combine them provided you ensure the same length of all vectors by first padding them with e.g. `NA`s as needed. That would be pretty ugly though. On the other hand, the advantage of using a `list` is that you have the full `*apply` arsenal at your disposal for working with the gene sets. – Maurits Evers Jan 24 '18 at 02:22
  • my end goal is to use clustering methods to establish similarities between the different gene signatures based on the genes contained within those signature sets. Thats why Im trying to create a matrix so I can use the clustering algorithms already established using a matrix. If I want to create vectors of equal length would I just create the vectors using an arbitrary length and use "fill=TRUE"? Also, lets say I create these vectors and theres a lot of NA's, would creating a sparse matrix out of that remove the NA's? Thank you for your help! – Rohan Singhal Jan 24 '18 at 02:40
  • @MauritsEvers, instead of `apply(...)`, just use `strsplit(as.character(df$V1), "\t")`. Also, technically, you can have `list` columns in a `data.frame` (even in a `matrix`) but using the data may not always be convenient. – A5C1D2H2I1M1N2O1R2T1 Jan 24 '18 at 03:30
  • @RohanSinghal, using "data.table", you should be able to do something like `library(data.table); out <- fread("msigdb.v6.1.symbols.gmt", sep = "\n", header = FALSE)[, tstrsplit(V1, "\t")]`. However, the resulting table would be 17786 rows by almost 2942 columns.... Use `table(colSums(is.na(out)))` and `table(rowSums(is.na(out)))` to get a sense of the missing data. – A5C1D2H2I1M1N2O1R2T1 Jan 24 '18 at 03:34
  • @RohanSinghal I've got no idea what you mean by *"use clustering methods to establish similarities between the different gene signatures ..."*. I don't understand how you want to cluster a matrix of gene symbols. What similarity measure/distance function do you want to use? Either way there will be overlap in genes between different gene sets, that's just how gene sets work. [...] – Maurits Evers Jan 24 '18 at 03:34
  • [...] If you want to identify gene signatures, a gene set enrichment analysis (GSEA) or gene ontology over-representation analysis would be the way forward. I don't see how column-binding `NA`-padded vectors gives you something meaningful for any downstream analysis. – Maurits Evers Jan 24 '18 at 03:34
  • @A5C1D2H2I1M1N2O1R2T1 Sure, there are many different ways of reading/parsing the MSigDB data. The point is that gene sets contain varying number of genes; IMO the most convenient way to handle the data is in a `list` of character vectors. – Maurits Evers Jan 24 '18 at 03:40