R: Big data: Determine string length

Question

My data looks like below with millions of lines. This text can be copied into a text file and read in for my example below.

@HISEQ:104:C7Y3WACXX:4:1101:1307:1946 1:N:0:CGATGT
NTCCGGTAGTGTAGCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACC
+                                                 
#0<FFFBBFBFFFFFIFIFIIIIIIIFIIIIIIIIIIIIIIIIFIIFIII
@HISEQ:104:C7Y3WACXX:4:1101:1356:1968 1:N:0:CGATGT
CGAGAGCTTTGAAGGCCGAAGTGGAAGATCGGAAGAGCACACGTCTGAAC
+                                                 
BBBFFFFFFFFFFFFFFFIIIBFFIIIIIFIIIIIIIIIIIIIFFFFFFF

I am trying to read in the text above and determine the length of the strings that start with N,C,G or T. I would usually do something like this:

f <- scan(filepath,nmax=8,what="character",sep="\n")
f1 <- f[grep("^[NAGCT]+",f)]
nchar(f1)

How would I go about doing the same with ff package?

library(ff)
f <- read.table.ffdf(file=filepath,header=F,nrow=8,sep="\n")

I have tried various approaches but none of them work.

Do you really need to use R? just simple bash commands may be better to do these things. And if you need to plot or anything, import later. — Ananta, Feb 01 '16 at 19:19
@Ananta I agree its easy to do in bash, but it just so happens that I am on windows and do not have access to linux at this point. Besides making things work in ff can come in handy... — mindlessgreen, Feb 01 '16 at 19:30
@MichaelChirico The data is too large to fit into the RAM which is why I would like to use `ff`. Does `fread()` not just read the whole thing in? — mindlessgreen, Feb 01 '16 at 19:31
In the example the lines are all the same length so it would be sufficient to read the first few lines, e.g. `max(nchar(readLines("myfile.dat", 10)))` — G. Grothendieck, Feb 01 '16 at 23:03
any reason you don't just use `library(ShortRead)` ? Also, unless this is trimmed data that you got from a database then Illumina reads are all the same length. — JeremyS, Feb 02 '16 at 04:10
The strings are variable lengths further down in the file. This was just an example. I am not very familiar with `ShortRead`. Does that work with larger than RAM datasets? — mindlessgreen, Feb 02 '16 at 09:10

R: Big data: Determine string length

0 Answers0