I want to use the readr
package to read in big fasta files and count the entries. The file has multiple rows and each entry starts with a >
. In general I am not interested in the other data, I just want to count the lines starting with >
.
I thought the most effective way is using read_lines_chunked
from readr
package, but the result is a bit strange.
s <- '>a\nb\nc\n>d\ne\n>f\ng\n>h\ni\nj\n>k\nl'
f <- function(x, pos) x[grepl('^>', x)]
jnk <- readr::read_lines_chunked(s, readr::DataFrameCallback$new(f), chunk_size=5)
The result is not a single vector with the lines as I expected but a matrix and even with strange results, since it list for example >k
two times:
[,1] [,2]
[1,] ">a" ">d"
[2,] ">f" ">h"
[3,] ">k" ">k"
Can somebody help me on this or suggest a better way of counting the lines starting with >
for big files without loading everything into memory...