0

I have this text dataframe with all columns being character vectors.

    Gene.ID                     barcodes              value
    A2M          TCGA-BA-5149-01A-01D-1512-08        Missense_Mutation
   ABCC10        TCGA-BA-5559-01A-01D-1512-08        Missense_Mutation
   ABCC11        TCGA-BA-5557-01A-01D-1512-08        Silent
   ABCC8         TCGA-BA-5555-01A-01D-1512-08        Missense_Mutation
   ABHD5         TCGA-BA-5149-01A-01D-1512-08        Missense_Mutation
   ACCN1         TCGA-BA-5149-01A-01D-1512-08        Missense_Mutation

How do I build a dataframe from this using reshape/reshape 2 such that I get a dataframe of the format Gene.ID~barcodes and the values being the text in the value column for each and "NA" or "WT" for a filler?

The default aggregation function keeps defaulting to length, which I want to avoid if possible.

  • I don't totally follow what you are trying to do, perhaps because I don't use reshape/reshape2 very often. Are you trying to get the data in a form where you have gene.id, barcodes, missense_mutation, silent, ... as variables? – iacobus Mar 26 '14 at 06:25
  • I am trying to get a dataframe with barcodes in columns and gene.IDs in rows, with "value" being the value of each cell. If value is missing for a particular gene/barcode combination I want it to be "WT" or "NA". – Ankur Chakravarthy Mar 26 '14 at 08:52
  • Do you have duplicated values in your "Gene.ID" or "barcodes" columns? – A5C1D2H2I1M1N2O1R2T1 Mar 28 '14 at 15:56
  • Yes, Ananda. Some genes are mutated in more than one sample. However, iacobus came up with a solution that means this is not a problem. – Ankur Chakravarthy Mar 30 '14 at 01:43

1 Answers1

0

I think this will work for your problem. First, I'm generating some data similar to yours. I'm making gene.id and barcode a factor for simplicity and this should be the same as your data.

geneNames <- c(paste("gene", 1:10, sep = ""))
data <- data.frame(gene = as.factor(c(1:10, 1:4, 6:10)),
                   express = sample(c("Silent", "Missense_Mutation"), 19, TRUE),
                   barcode = as.factor(c(rep(1, 10), rep(2, 9))))

I made a vector geneNames a vector of the gene names (e.g, A2M). In order to get the NA values in those missing an expression of a given gene, you need to merge the data such that you have number_of_genes by number_of_barcodes rows.

geneID <- unique(data$gene)
data2 <- data.frame(barcode = rep(unique(data$barcode), each = length(geneID)),
                    gene = geneID)
data3 <- merge(data, data2, by = c("barcode", "gene"), all.y = TRUE)

Now melting and casting the data,

library(reshape)
mdata3 <- melt(data3, id.vars = c("barcode", "gene"))
cdata <- cast(mdata3, barcode ~ variable + gene, identity)
names(cdata) <- c("barcode", geneNames)

You should then have a data frame with number_of_barcodes rows and with (number_of_unique_genes + 1) columns. Each column should contain the expression information for that particular gene in that particular sample barcode.

iacobus
  • 587
  • 3
  • 10