Multiple skips when reading a csv file

Question

I have a very large file of which I only need the first element of rows 1, 100001, 200001, which I extract like this:

x1 <- read.csv(filename, nrows = 1, header = F)[1, 1]
x2 <- read.csv(filename, skip = 100000, nrows = 1, header = F)[1, 1]
x3 <- read.csv(filename, skip = 200000, nrows = 1, header = F)[1, 1]

I don't know how reading works, but I assume this forces some unnessesary reading/skipping.

I wonder if I could continue skipping after reading x2 in stead of starting at the beginning of the file again. That would save some time.

I do not want to have the whole file (or the whole first column) in memory (at some point) if I can avoid it.

Might be relevant [Read lines by number from a large file](https://stackoverflow.com/questions/7156770/read-lines-by-number-from-a-large-file/7156792) — benson23, Jun 08 '23 at 14:57
I was _just_ about to answer that, @benson23! :-) Working code for a CSV: `con <- file("~/tmp/mt.csv"); open(con); read.csv(con, header=TRUE, nrows=5); read.csv(con, header=FALSE, nrows=5); read.csv(con, header=FALSE, nrows=5);` where the first call to `read.csv` has the correct headers (which you may want to store), and subsequent calls to `read.csv` must not look for a header row. — r2evans, Jun 08 '23 at 15:03
As an alternative ... while `arrow::read_csv_arrow(..., as_data_frame=FALSE)` does not support slicing by row number, it does allow lazy filtering and pulling only rows/columns you need each time. See https://arrow.apache.org/docs/r/articles/data_wrangling.html. — r2evans, Jun 08 '23 at 15:05

Rui Barradas · Answer 1 · 2023-06-09T00:39:32.903

Here is a way with scan. It assumes you are reading numeric data, if not include

what = character()

in the calls to scan. Test file at end.
Note that I'm skipping 10 lines, not 1000.

fl <- "~/Temp/so.csv"

sep = ","
skip <- 10L

vec <- NULL
skp <- 0L
x <- scan(fl, sep = sep, n = 1L, nlines = 1L)
while(length(x) > 0L) {
  vec <- c(vec, x)
  skp <- skp + skip
  x <- scan(fl, sep = sep, n = 1L, skip = skp, nlines = 1L)
}
vec
#> [1]  1 11 21 31

^{Created on 2023-06-08 with reprex v2.0.2}

Data

This is the contents of the test file (40 rows).

1,a,21
2,b,22
3,c,23
4,a,24
5,b,25
6,c,26
7,a,27
8,b,28
9,c,29
10,a,30
11,b,31
12,c,32
13,a,33
14,b,34
15,c,35
16,a,36
17,b,37
18,c,38
19,a,39
20,b,40
21,c,41
22,a,42
23,b,43
24,c,44
25,a,45
26,b,46
27,c,47
28,a,48
29,b,49
30,c,50
31,a,51
32,b,52
33,c,53
34,a,54
35,b,55
36,c,56
37,a,57
38,b,58
39,c,59
40,a,60

Multiple skips when reading a csv file

1 Answers1

Data