4

I have several txt files with 3 columns in each files like this: file 1:

ProbeID X_Signal_intensity X_P-Value   
xxx         2.34          .89
xxx         6.45          .04 
xxx         1.09          .91  
xxx         5.87          .70
.            .            . 
.            .            .
.            .            .     

file 2:

ProbeID Y_Signal_intensity Y_P-Value   
xxx         1.4             .92
xxx         2.55            .14 
xxx         4.19            .16  
xxx         3.47            .80
.            .               . 
.            .               .
.            .               . 

file 3:

ProbeID Z_Signal_intensity Z_P-Value   
xxx         9.40             .82
xxx         1.55            .04 
xxx         3.19            .56  
xxx         2.47            .90
.            .               . 
.            .               .
.            .               . 

In all the above files the values of ProbeID column are identical but not the other columns.Now I want to combine the all the above files using a for-loop into a single file like this:

ProbeID X_intensity X_P-Value   Y_intensity Y_P-Value   Z_intensity Z_P-Value     
xxx      2.34          .89       1.4             .92     9.40            .82
xxx      6.45          .04       2.55            .14     1.55            .04
xxx      1.09          .91       4.19            .16     3.19            .56
xxx      5.87          .70       3.47            .80     2.47            .90

Please do help me.

eipi10
  • 91,525
  • 24
  • 209
  • 285
Dinesh
  • 643
  • 5
  • 16
  • 31
  • See related questions : http://stackoverflow.com/questions/3764292/loading-many-files-at-once , http://stackoverflow.com/questions/4234179/combine-the-multiple-files-with-one-header , ... – Joris Meys Aug 04 '11 at 14:02

4 Answers4

4

Read in the files as given by Richie Cotton, but make sure you add the appropriate extra arguments in the apply call. For one, header=TRUE should probably be added.

file.names <- c("file X.txt", "file Y.txt", "file Z.txt")
file.list <- lapply(file.names, read.table, header=TRUE)

Then you'll probably need a merge_recurse from the reshape package :

require(reshape)
mynewframe <- merge_recurse(file.list,all.x=TRUE,all.y=TRUE,by="ProbeID")

This will work for any given number of dataframes, provided it's not a billion of them. For more information on the arguments used, see the help page of ?merge.

CORRECTION : in merge_recurse, you have to use all.x and all.y as shown in the correction above. You can't just use the shortcut all or you'll get errors.

Small demonstration :

X2 <- data.frame(ProbeID=(2:4),Z2=4:6)
X1 <- data.frame(ProbeID=1:3,Z1=1:3)
X3 <- data.frame(ProbeID=1:3,Z3=7:9)
file.list <- list(X1,X2,X3)
mynewframe <- merge_recurse(file.list,all.x=TRUE,all.y=TRUE,by="ProbeID")
> mynewframe
  ProbeID Z1 Z2 Z3
1       1  1 NA  7
2       2  2  4  8
3       3  3  5  9
4       4 NA  6 NA
Joris Meys
  • 106,551
  • 31
  • 221
  • 263
  • 1
    Note that there is a bug in merge_recurse that prevents it from using the additional arguments you are passing. In fact, merge_recurse(file.list) will produce exactly the same output. The bug affects any merge of three or more data frames. – Chris Warth Apr 18 '14 at 17:48
2

Read in your files

filenames <- c("file X.txt", "file Y.txt", "file Z.txt")
data_list <- lapply(filenames, read.table)

Combine them into one big data frame

all_data <- do.call(cbind, data_list)

all_data <- do.call(merge, data_list, by = "ProbeID")

This gives a good lesson to "always concentrate when providing an answer". cbind isn't smart enough to do ID matching, and merge isn't smart enough to handle more than two data frames. Take a look at Joris's answer and use merge_recurse instead. Or forget what you thought you wanted and use my other answer below.


Actually, a better idea, rather than having many columns would be to have just 4 columns: ProbeID, Signal_intensity, P_value and Source_file.

data_list <- lapply(data_list, function(x) {
  colnames(x) <- c("ProbeID", "Signal_intensity", "P_value")
  x
})

all_data <- do.call(rbind, data_list)
all_data$Source_file <- rep(filenames, times = sapply(data_list, nrow))
Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
  • I deleted the comments for cleanup (SO starts complaining lately about extended comments). Feel free to do so too. – Joris Meys Aug 04 '11 at 17:13
0

My approach is to read the files into data.frames

see help(read.delim) for reading modes.

After you have your three data.frames you can use

total <- merge(dataframeA,dataframeB,by="ProbeID")

look here http://www.statmethods.net/management/merging.html for documentation.

Sarah West
  • 2,047
  • 2
  • 20
  • 37
0

I am going to throw another approach into the mix which uses Reduce

Reduce(function(...) merge(..., all = T), file.list)
Ramnath
  • 54,439
  • 16
  • 125
  • 152