7

My question involves how to skip metadata in the beginning of a file when importing data into R. My data is in .txt format where the first lines are metadata describing the data and these need to be filtered out. Below is a minimal example of the data frame in tab delimited format:

Type=GenePix Export                         
DateTime=2010/03/04 16:04:16                        
PixelSize=10                        
Wavelengths=635                     
ImageFiles=Not Saved                        
NormalizationMethod=None                        
NormalizationFactors=1                      
JpegImage=                      
StdDev=Type 1                       
FeatureType=Circular                        
Barcode=                        
BackgroundSubtraction=LocalFeature                      
ImageOrigin=150, 10                     
JpegOrigin=150, 2760                        
Creator=GenePix Pro 7.2.29.002                      
var1    var2    var3    var4    var5    var6    var7
1   1   1   molecule1   1F3 400 4020
1   2   1   molecule2   1B5 221 4020
1   3   1   molecule3   1H5 122 2110
1   4   1   molecule4   1D1 402 2110
1   5   1   molecule5   1F1 600 4020

I could use the basic command shown below if I know the line that the actual data starts from:

mydata <- read.table("mydata.txt",header=T, skip=15)

Which would return;

mydata
  var1 var2 var3      var4 var5 var6 var7
1    1    1    1 molecule1  1F3  400 4020
2    1    2    1 molecule2  1B5  221 4020
3    1    3    1 molecule3  1H5  122 2110
4    1    4    1 molecule4  1D1  402 2110
5    1    5    1 molecule5  1F1  600 4020

The problem is that I need to write a script that can read various datasets where the row number where the actual data starts from varies from one data set to another. I could imagine using something like the sqldf package but I am not quite familiar with sql.

Any assistance would be greatly appreciated.

amo
  • 3,030
  • 4
  • 25
  • 42
  • Have you determined what would determine the splitting point between metadata and actual data (this would need to work for all datasets; for example, maybe data uses tabulations for field separators, and no tabulations are ever present in metadata)? – Dominic Comtois Mar 16 '15 at 07:38
  • @amo Do you have any specific patterns where the column names begin? – akrun Mar 16 '15 at 07:38
  • Yes, one splitting point between metadata and actual data is that data uses tabs for field separators, and no tabs are ever present in metadata and this works for all datasets @DominicComtois – amo Mar 16 '15 at 08:19
  • Yes, there is a specific pattern where the column names begin i.e. var1 and this works for all datasets @akrun – amo Mar 16 '15 at 08:20
  • @amo Thanks, I updated the solution based on that info – akrun Mar 16 '15 at 08:44

5 Answers5

3

How about using the already existing functions for reading the DNA microarray data? These are available in the packages developed by the Bioconductor project.

For example, roughly something like this

library(limma)

mydata<-read.maimages("mydata.txt", source="genepix")

See the limma manual for more examples. It can readily import most DNA microarray formats.

  • When I tried this code using the input data, `read.maimages('mydata.txt', source='genepix') #Error in readGPRHeader(fullname) : File is not in Axon Text File (ATF) format` – akrun Mar 16 '15 at 11:46
3

You could use count.fields() to determine the skip argument. I call your file "x.txt"

read.table("x.txt", skip = which.max(count.fields("x.txt") == 7) - 1, 
    header = TRUE)
#   var1 var2 var3      var4 var5 var6 var7
# 1    1    1    1 molecule1  1F3  400 4020
# 2    1    2    1 molecule2  1B5  221 4020
# 3    1    3    1 molecule3  1H5  122 2110
# 4    1    4    1 molecule4  1D1  402 2110
# 5    1    5    1 molecule5  1F1  600 4020

So this starts reading the file at the first occurrence of seven fields

Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
  • Why not `skip = which.max(count.fields("x.txt")) - 1`, so that having a file with 6, 8 or 12 columns would also work? – Dominic Comtois Mar 16 '15 at 12:12
  • Yes, I would consider just having `skip = which.max(count.fields("x.txt")) - 1` an improvement to the solution as it works generally in different cases. @DominicComtois – amo Mar 16 '15 at 15:01
  • @amo And btw you can add `sep="\t"` as an argument to `count.fields` to be even more thorough. – Dominic Comtois Mar 16 '15 at 15:03
1

Suppose if all the files have Creator as the last metadata line,

read.table(pipe("awk 'NR ==1, /Creator/ {next}{print}' mydata.txt"),
              header=TRUE)
#  var1 var2 var3      var4 var5 var6 var7
#1    1    1    1 molecule1  1F3  400 4020
#2    1    2    1 molecule2  1B5  221 4020
#3    1    3    1 molecule3  1H5  122 2110
#4    1    4    1 molecule4  1D1  402 2110
#5    1    5    1 molecule5  1F1  600 4020

If you know the number of columns, you could also do

read.table(pipe("awk 'NF==7{print}' mydata.txt"), header=TRUE)
#  var1 var2 var3      var4 var5 var6 var7
#1    1    1    1 molecule1  1F3  400 4020
#2    1    2    1 molecule2  1B5  221 4020
#3    1    3    1 molecule3  1H5  122 2110
#4    1    4    1 molecule4  1D1  402 2110
#5    1    5    1 molecule5  1F1  600 4020

Update

If we need to start reading from the first occurence of 'var1' to the end of the file,

  read.table(pipe("awk '/var1/ { matched = 1}matched { print }' mydata.txt"), 
             header=TRUE)    
 #   var1 var2 var3      var4 var5 var6 var7
 #1    1    1    1 molecule1  1F3  400 4020
 #2    1    2    1 molecule2  1B5  221 4020
 #3    1    3    1 molecule3  1H5  122 2110
 #4    1    4    1 molecule4  1D1  402 2110
 #5    1    5    1 molecule5  1F1  600 4020

The above solutions work fine on a linux system. On Windows, it failed (as per the comments). An option that could work on both systems is

 lines <- readLines('mydata.txt')
 read.table(text=lines[grep('var1', lines):length(lines)],header=TRUE)
 #   var1 var2 var3      var4 var5 var6 var7
 #1    1    1    1 molecule1  1F3  400 4020
 #2    1    2    1 molecule2  1B5  221 4020
 #3    1    3    1 molecule3  1H5  122 2110
 #4    1    4    1 molecule4  1D1  402 2110
 #5    1    5    1 molecule5  1F1  600 4020
akrun
  • 874,273
  • 37
  • 540
  • 662
  • The character in the last metadata line varies from one data set to another so solution one might not work in all cases. About the two other possible solutions I get the errors: `Error in read.table(pipe("awk 'NF==7{print}' mydata.txt"), header = TRUE) : no lines available in input` and `Error in read.table(pipe("awk '/var1/ { matched = 1}matched { print }' mydata.txt"), : no lines available in input`. What might be wrong? @akrun – amo Mar 16 '15 at 09:49
  • @amo I was just copy/pasting the input data showed in the post and saved it as `mydata.txt`. So, I don't know what might be the error in your case as it works on my system. – akrun Mar 16 '15 at 10:41
  • 1
    @DominicComtois I am using `linux mint` But, I think it could work on windows http://stackoverflow.com/questions/21927944/how-to-run-an-awk-commands-in-windows – akrun Mar 16 '15 at 10:44
  • @amo If I understand your comments, the last two solutions get the error. Do you have the problem with the first code? (just for curiosity) – akrun Mar 16 '15 at 10:48
  • There is actually an awk version for Windows but it's not there by default, gotta install it from http://gnuwin32.sourceforge.net/packages/gawk.htm – Dominic Comtois Mar 16 '15 at 11:02
  • @DominicComtois Thanks for the link. I don't use windows that much, so didn't know that this will not work – akrun Mar 16 '15 at 11:04
  • Yes, I have a problem with the first solution too @akrun – amo Mar 16 '15 at 11:28
  • I am on windows 8. It seems from the gawk installation site that there is no gawk version for Windows 8 @DominicComtois – amo Mar 16 '15 at 11:29
  • Ok... well that explains why this solution doesn't work. But it seems you have plenty of other very good options! – Dominic Comtois Mar 16 '15 at 11:30
  • @amo Anyway, this will be an option if you have a chance to use linux system. – akrun Mar 16 '15 at 11:30
1

A solution based on the presence of tabulations in the actual data (and not in the metadata). As a "bonus" you have the option to display (via cat whatever lines were considered metadata).

Main reading function

read.genepix <- function(filename, disp.meta = FALSE) {

    infile <- file(description = filename, open = "r" )

    # create a meta indicator function
    is.meta <- function(text) !grepl(pattern = "\\t", x = text)

    # Prepare to store meta text (if needed)
    meta.text <- c()
    meta <- TRUE

    while(isTRUE(meta)) {

        last.pos <- seek(infile, where = NA)
        current.line <- readLines(infile, n = 1)
        meta <- is.meta(current.line)

        if(isTRUE(meta)) {
            meta.text <- append(meta.text, current.line)
        } else {
            seek(infile, where = last.pos)
            data.txt <- paste0(readLines(infile),collapse="\n")
            close(infile)
            break
        }
    }

    if(isTRUE(disp.meta)) {
        cat(paste(meta.text, collapse="\n"))
    }

    return(read.table(text=data.txt, header = TRUE, sep = "\t", quote=""))
}

Usage / Results

my.data <- read.genepix("somefile.txt")

my.data

#   var1 var2 var3      var4 var5 var6 var7
# 1    1    1    1 molecule1  1F3  400 4020
# 2    1    2    1 molecule2  1B5  221 4020
# 3    1    3    1 molecule3  1H5  122 2110
# 4    1    4    1 molecule4  1D1  402 2110
# 5    1    5    1 molecule5  1F1  600 4020

Sample data used in this answer (saved to disk as "somefile.txt") - but note that SO replaces tabs with series of spaces in the data section - so in your text editor you'll need to replace those spaces with tabulations in order for the code to work.

capture.output(cat("Type=GenePix Export
DateTime=2010/03/04 16:04:16
PixelSize=10
Wavelengths=635
ImageFiles=Not Saved
NormalizationMethod=None
NormalizationFactors=1
JpegImage=
StdDev=Type 1
FeatureType=Circular
Barcode=
BackgroundSubtraction=LocalFeature
ImageOrigin=150, 10
JpegOrigin=150, 2760
Creator=GenePix Pro 7.2.29.002
var1    var2    var3    var4    var5    var6    var7
1   1   1   molecule1   1F3 400 4020
1   2   1   molecule2   1B5 221 4020
1   3   1   molecule3   1H5 122 2110
1   4   1   molecule4   1D1 402 2110
1   5   1   molecule5   1F1 600 4020
"), file="somefile.txt")
Dominic Comtois
  • 10,230
  • 1
  • 39
  • 61
  • The function `read.genepix` seems to print the metadata irrespective of whether `print.meta = TRUE)` or `print.meta = FALSE)`. I guess there is a small bug in the function. @DominicComtois – amo Mar 16 '15 at 09:57
  • I updated the solution so that the file gets read only once; the printing of meta should also work fine now. – Dominic Comtois Mar 16 '15 at 09:59
  • The problem with the printing of the metadata even when `print.meta = FALSE)` still appears @DominicComtois – amo Mar 16 '15 at 12:26
  • I can't really say why... it didn't on my machine. I renamed the argument in case it had anything to do with it, and added a stronger condition; it should work fine now. If not, make sure you restart your session before re-testing it pls! – Dominic Comtois Mar 16 '15 at 12:51
1

The comments to your question describe the following:

  1. The metadata section does not include any tabs
  2. The data section is tab-delimited

As such, I'm guessing that your data are something like the sample data found at the end of this question.

If this is the case, you can use the magic of fread to automatically determine where your data starts.

Here's a demo:

cat(A, file = "mytest.txt", sep = "\n")
library(data.table)
fread("mytest.txt")
#    var1 var2 var3      var4 var5 var6 var7
# 1:    1    1    1 molecule1  1F3  400 4020
# 2:    1    2    1 molecule2  1B5  221 4020
# 3:    1    3    1 molecule3  1H5  122 2110
# 4:    1    4    1 molecule4  1D1  402 2110
# 5:    1    5    1 molecule5  1F1  600 4020

Sample data:

A <- c("Type=GenePix Export", "DateTime=2010/03/04 16:04:16", "PixelSize=10", 
"Wavelengths=635", "ImageFiles=Not Saved", "NormalizationMethod=None", 
"NormalizationFactors=1", "JpegImage=", "StdDev=Type 1", "FeatureType=Circular", 
"Barcode=", "BackgroundSubtraction=LocalFeature", "ImageOrigin=150, 10", 
"JpegOrigin=150, 2760", "Creator=GenePix Pro 7.2.29.002", 
"var1\tvar2\tvar3\tvar4\tvar5\tvar6\tvar7", 
"1\t1\t1\tmolecule1\t1F3\t400\t4020", "1\t2\t1\tmolecule2\t1B5\t221\t4020", 
"1\t3\t1\tmolecule3\t1H5\t122\t2110", "1\t4\t1\tmolecule4\t1D1\t402\t2110", 
"1\t5\t1\tmolecule5\t1F1\t600\t4020")

A
#  [1] "Type=GenePix Export"               
#  [2] "DateTime=2010/03/04 16:04:16"      
#  [3] "PixelSize=10"                      
#  [4] "Wavelengths=635"                   
#  [5] "ImageFiles=Not Saved"              
#  [6] "NormalizationMethod=None"          
#  [7] "NormalizationFactors=1"            
#  [8] "JpegImage="                        
#  [9] "StdDev=Type 1"                     
# [10] "FeatureType=Circular"              
# [11] "Barcode="                          
# [12] "BackgroundSubtraction=LocalFeature"
# [13] "ImageOrigin=150, 10"               
# [14] "JpegOrigin=150, 2760"              
# [15] "Creator=GenePix Pro 7.2.29.002"    
# [16] "var1\tvar2\tvar3\tvar4\tvar5\tvar6\tvar7"
# [17] "1\t1\t1\tmolecule1\t1F3\t400\t4020"      
# [18] "1\t2\t1\tmolecule2\t1B5\t221\t4020"      
# [19] "1\t3\t1\tmolecule3\t1H5\t122\t2110"      
# [20] "1\t4\t1\tmolecule4\t1D1\t402\t2110"      
# [21] "1\t5\t1\tmolecule5\t1F1\t600\t4020"    
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • 1
    The solution involving `fread("mytest.txt")` works with the data set created by converting the original data set to a different format using `cat(A, file = "mytest.txt", sep = "\n")` but it does not filter out the metadata in the original data set. @AnandaMahto – amo Mar 16 '15 at 12:46
  • @amo, you will always get best input if you provide a reproducible example. An easy way to do this would be to use something like `dput(readLines("mytest.txt", n = 20))` (where "mytest.txt" should be replaced by your actual file name) so that we can see what exactly the first 20 lines of your file look like. With that, we can all use `cat` to recreate a sample file and test our solutions. – A5C1D2H2I1M1N2O1R2T1 Mar 16 '15 at 13:04
  • The process of recreating a sample file works fine with the minimal data set but It might be quite a task for the real data set which contains >3000 rows. @AnandaMahto – amo Mar 16 '15 at 15:06
  • @amo, My point was simply that since you didn't provide reproducible sample data, I created sample data that describes what you mentioned in your comments: a file with metadata that doesn't contain tabs followed by tabular data with each column separated by tabs. I pointed you to `fread` because `fread` handles such problems with ease, even datasets with millions of rows. Sorry it didn't work out. I'm guessing that your description of the problem was off. My *guess* is that there *are* tabs in your metadata. You can easily inspect that by setting your text editor to show whitespace characters. – A5C1D2H2I1M1N2O1R2T1 Mar 16 '15 at 15:34