0

I want to use the readr package to read in big fasta files and count the entries. The file has multiple rows and each entry starts with a >. In general I am not interested in the other data, I just want to count the lines starting with >.

I thought the most effective way is using read_lines_chunked from readr package, but the result is a bit strange.

s <- '>a\nb\nc\n>d\ne\n>f\ng\n>h\ni\nj\n>k\nl'
f <- function(x, pos) x[grepl('^>', x)]
jnk <- readr::read_lines_chunked(s, readr::DataFrameCallback$new(f), chunk_size=5)

The result is not a single vector with the lines as I expected but a matrix and even with strange results, since it list for example >k two times:

     [,1] [,2]
[1,] ">a" ">d"
[2,] ">f" ">h"
[3,] ">k" ">k"

Can somebody help me on this or suggest a better way of counting the lines starting with > for big files without loading everything into memory...

drmariod
  • 11,106
  • 16
  • 64
  • 110
  • What's your OS? On Mac/linux you can just do that in the terminal `grep ">" file | wc -l` – konvas Jan 20 '17 at 11:44
  • It is a mac, but it will be used on win and unix... thats why I wanted to provide a R solution instead of a command line – drmariod Jan 20 '17 at 11:47
  • If you just want the number of lines, you can modify your function to `f <- function(x, pos) sum(grep('^>', x))`, then the total number in the file (across all chunks) is `sum(jnk)`. But I'd strongly suggest going the grep + wc way, I think there are gnu tools on windows but I don't have experience using them – konvas Jan 20 '17 at 12:10
  • 1
    By the way, the reason that ">k" appears twice on the last line is because if a chunk has fewer matches than other chunks, the result gets recycled - if you change to `chunk_size = 1`, you'll get the expected result – konvas Jan 20 '17 at 12:22
  • I realised I also need the text and not just the counts, but by wrapping the `read_lines_chunked` in a `unique(as.vector())` it worked as I expected and is pretty quick... – drmariod Jan 20 '17 at 12:43

0 Answers0