14

I have a bunch of files which I'm merging in one data frame. The file names are as such: unc.edu.b6530750-0410-43ec-bb79-f862ca3424a6.1918120.rsem.genes.results

And I want the file names to be the column names. I'm using the following code:

for (file in file_list){

  if (!exists("dataset")){
      dataset <- read.table(file, header=TRUE, colClasses = c(rep("character", 2),                     rep("NULL", 2)), col.names = c("gene_id", deparse(substitute(file)), "NuLL", "NULL"), sep="\t")
      print(deparse(substitute(file)))
    }

    if (exists("dataset")){
      temp_dataset <-read.table(file, header=TRUE, colClasses = c(rep("character", 2), rep("NULL", 2)), col.names = c("gene_id", deparse(substitute(file)), "NuLL", "NULL"), sep="\t")
      print(deparse(substitute(file)))
      dataset<-merge(dataset, temp_dataset, by = "gene_id")
      rm(temp_dataset)
    }
}

All goes well except that the column names now have underscores replaced by dots.

colnames(data)

[1] "gene_id"                                                                       
[2] "X...unc.edu.02cb8dbe.ef56.471c.b52d.41c29219fd95.1794854.rsem.genes.results..x"
[3] "X...unc.edu.02cb8dbe.ef56.471c.b52d.41c29219fd95.1794854.rsem.genes.results..y"
[4] "X...unc.edu.02f5dcba.bdcc.4424.aed4.195a8d551325.2085643.rsem.genes.results."  

Any explanation as to what causes this would be helpful because I will need to change these names, using another file, later on.

Carmen Sandoval
  • 2,266
  • 5
  • 30
  • 46
paul_dg
  • 511
  • 5
  • 16

1 Answers1

6

As @akrun stated in the comments, read.table(file, ..., check.names=FALSE) will solve the immediate problem.

However, there are now neater ways to achieve what you're trying to do using some of the tidyverse packages.

First let's load packages and generate some sample data:

library(purrr)
library(readr)
data <- c("gene_id\tresult\trandom_a\trandom_b
TNF\t1e-8\t1.7\t4.3
IL8\t0.4\t-0.3\t8.6",
"gene_id\tresult\trandom_a\trandom_b
TNF\t2.4e-7\t1.7\t4.3
IL8\t0.9\t0.8\t8.3",
"gene_id\tresult\trandom_a\trandom_b
TNSF8\t0.003\t2.1\t9.7
IL8\t0.02\t1.9\t4.6")
file_list <- sprintf("file_%d.csv", 1:3)
walk2(data, file_list, ~write_tsv(read_tsv(.x), .y))

Now here's the actual bit that reads and merges the data:

library(purrr)
library(readr)
library(dplyr)
dataset <- file_list %>%
  map(~read_tsv(.x, col_types = "cc__", col_names = c("gene_id", .x), skip = 1)) %>%
  reduce(full_join, by = "gene_id")

This uses map to read in each file one by one, skipping the first presumably header row and the third and fourth columns, and renames the resulting columns as gene_id and with the name of the file. These are then sequentially joined using dplyr::full_join and purrr::reduce.

Although this question was asked a long time ago, this type of task is common, so I thought a tidyverse-based answer would still be useful. (And it's still in the 'unanswered questions with votes' filter.)

Nick Kennedy
  • 12,510
  • 2
  • 30
  • 52
  • I don't see why this would be a 'neater' way than `read.table(file, ..., check.names=FALSE)`. But more complex, more chances to do an error and more dependencies for sure. – Sebastien Renaut Mar 22 '21 at 20:11
  • 1
    @SebastienRenaut I think you may have missed the point of the latter part of my answer; it wasn’t an alternative to using `check.names`, but an alternative to the overall loop in the OP’s code. It uses three lines to achieve what took 12 in the original code. – Nick Kennedy Mar 23 '21 at 01:09