0

My data looks like below with millions of lines. This text can be copied into a text file and read in for my example below.

@HISEQ:104:C7Y3WACXX:4:1101:1307:1946 1:N:0:CGATGT
NTCCGGTAGTGTAGCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACC
+                                                 
#0<FFFBBFBFFFFFIFIFIIIIIIIFIIIIIIIIIIIIIIIIFIIFIII
@HISEQ:104:C7Y3WACXX:4:1101:1356:1968 1:N:0:CGATGT
CGAGAGCTTTGAAGGCCGAAGTGGAAGATCGGAAGAGCACACGTCTGAAC
+                                                 
BBBFFFFFFFFFFFFFFFIIIBFFIIIIIFIIIIIIIIIIIIIFFFFFFF

I am trying to read in the text above and determine the length of the strings that start with N,C,G or T. I would usually do something like this:

f <- scan(filepath,nmax=8,what="character",sep="\n")
f1 <- f[grep("^[NAGCT]+",f)]
nchar(f1)

How would I go about doing the same with ff package?

library(ff)
f <- read.table.ffdf(file=filepath,header=F,nrow=8,sep="\n")

I have tried various approaches but none of them work.

mindlessgreen
  • 11,059
  • 16
  • 68
  • 113
  • 3
    Do you really need to use R? just simple bash commands may be better to do these things. And if you need to plot or anything, import later. – Ananta Feb 01 '16 at 19:19
  • @Ananta I agree its easy to do in bash, but it just so happens that I am on windows and do not have access to linux at this point. Besides making things work in ff can come in handy... – mindlessgreen Feb 01 '16 at 19:30
  • @MichaelChirico The data is too large to fit into the RAM which is why I would like to use `ff`. Does `fread()` not just read the whole thing in? – mindlessgreen Feb 01 '16 at 19:31
  • In the example the lines are all the same length so it would be sufficient to read the first few lines, e.g. `max(nchar(readLines("myfile.dat", 10)))` – G. Grothendieck Feb 01 '16 at 23:03
  • any reason you don't just use `library(ShortRead)` ? Also, unless this is trimmed data that you got from a database then Illumina reads are all the same length. – JeremyS Feb 02 '16 at 04:10
  • The strings are variable lengths further down in the file. This was just an example. I am not very familiar with `ShortRead`. Does that work with larger than RAM datasets? – mindlessgreen Feb 02 '16 at 09:10

0 Answers0