2

I have a very large file of which I only need the first element of rows 1, 100001, 200001, which I extract like this:

x1 <- read.csv(filename, nrows = 1, header = F)[1, 1]
x2 <- read.csv(filename, skip = 100000, nrows = 1, header = F)[1, 1]
x3 <- read.csv(filename, skip = 200000, nrows = 1, header = F)[1, 1]

I don't know how reading works, but I assume this forces some unnessesary reading/skipping.

I wonder if I could continue skipping after reading x2 in stead of starting at the beginning of the file again. That would save some time.

I do not want to have the whole file (or the whole first column) in memory (at some point) if I can avoid it.

Řídící
  • 248
  • 1
  • 9
  • Might be relevant [Read lines by number from a large file](https://stackoverflow.com/questions/7156770/read-lines-by-number-from-a-large-file/7156792) – benson23 Jun 08 '23 at 14:57
  • 3
    I was _just_ about to answer that, @benson23! :-) Working code for a CSV: `con <- file("~/tmp/mt.csv"); open(con); read.csv(con, header=TRUE, nrows=5); read.csv(con, header=FALSE, nrows=5); read.csv(con, header=FALSE, nrows=5);` where the first call to `read.csv` has the correct headers (which you may want to store), and subsequent calls to `read.csv` must not look for a header row. – r2evans Jun 08 '23 at 15:03
  • @Řídící, confirm this works for you please! – r2evans Jun 08 '23 at 15:04
  • 1
    As an alternative ... while `arrow::read_csv_arrow(..., as_data_frame=FALSE)` does not support slicing by row number, it does allow lazy filtering and pulling only rows/columns you need each time. See https://arrow.apache.org/docs/r/articles/data_wrangling.html. – r2evans Jun 08 '23 at 15:05

1 Answers1

3

Here is a way with scan. It assumes you are reading numeric data, if not include

what = character()

in the calls to scan. Test file at end.
Note that I'm skipping 10 lines, not 1000.

fl <- "~/Temp/so.csv"

sep = ","
skip <- 10L

vec <- NULL
skp <- 0L
x <- scan(fl, sep = sep, n = 1L, nlines = 1L)
while(length(x) > 0L) {
  vec <- c(vec, x)
  skp <- skp + skip
  x <- scan(fl, sep = sep, n = 1L, skip = skp, nlines = 1L)
}
vec
#> [1]  1 11 21 31

Created on 2023-06-08 with reprex v2.0.2


Data

This is the contents of the test file (40 rows).

1,a,21
2,b,22
3,c,23
4,a,24
5,b,25
6,c,26
7,a,27
8,b,28
9,c,29
10,a,30
11,b,31
12,c,32
13,a,33
14,b,34
15,c,35
16,a,36
17,b,37
18,c,38
19,a,39
20,b,40
21,c,41
22,a,42
23,b,43
24,c,44
25,a,45
26,b,46
27,c,47
28,a,48
29,b,49
30,c,50
31,a,51
32,b,52
33,c,53
34,a,54
35,b,55
36,c,56
37,a,57
38,b,58
39,c,59
40,a,60
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66