Importing only every Nth row from a .csv file in R

Question

just a quick question. Is there a way to use read.csv to import every Nth row from a large file:

Example, a 50-60 million line file where you only need every 4th row starting at row 2.

I thought about maybe incorporating the 'seq' function, but I am not sure if that is possible.

Any suggestions?

score 16 · Accepted Answer · answered Feb 19 '14 at 21:18

For a large data file the best option is to filter out unnecessary row before they get imported into R. The simplest way to do this is by the means of the OS commands, like sed, awk, grep etc. The following code reads every 4th line from the file: for example:

write.csv(1:1000, file='test.csv')

file.pipe <- pipe("awk 'BEGIN{i=0}{i++;if (i%4==0) print $1}' < test.csv ")
res <- read.csv(file.pipe)
res

> res
     X3 X3.1
1     7    7
2    11   11
3    15   15
4    19   19
5    23   23
6    27   27
7    31   31
8    35   35

And to start at the 2nd line you would just change `i%4==0` to `i%4==2`. — Greg Snow, Feb 19 '14 at 21:21

score 5 · Answer 2 · answered Feb 19 '14 at 20:40

5

read.csv("filename.csv")[c(FALSE, TRUE, FALSE, FALSE), ]

will do the trick.

This works since the logical vector is recycled until it matches the number of rows of the data frame returned by read.csv.

answered Feb 19 '14 at 20:40

Sven Hohenstein

80,497
17
145
168

8

Note that this will read the entire file into memory and convert it to a data frame, then throw away the three quarters of the data that you do not want. This is a great approach for moderate sized files, but if you are doing this because the file is too large to fit into memory, then this method will not work. – Greg Snow Feb 19 '14 at 21:14
this will not do the trick if the reason why one wanted to kick out the other rows from the file was that they have a different number of columns – Emile Zäkiev Jan 18 '23 at 13:50

score 5 · Answer 3 · answered Feb 19 '14 at 21:21

5

As @df239 suggested, its much better to filter the rows beforehand using a commandline tool.

Here's a simpler version using sed:

df <- read.csv(pipe("sed -n '2~4p' test.csv"))

The 2~4p tells sed to get every 4th line, starting at line 2.

answered Feb 19 '14 at 21:21

Scott Ritchie

10,293
3
28
64

2

So I tried this and got the following error: Error in read.table(file = file, header = header, sep = sep, quote = quote, : no lines available in input – tomathon Feb 19 '14 at 22:08
1

What system are you running on? – Scott Ritchie Feb 19 '14 at 23:30
1

OSX 10.9.1 R Studio Version 0.98.490 – tomathon Feb 19 '14 at 23:36
1

Ahh, the `sed` on mac doesn't like `2~4p`. I actually have no idea how to fix it. Try `gsed` instead, that will work, if its installed. – Scott Ritchie Feb 20 '14 at 00:28

score 2 · Answer 4 · answered Feb 19 '14 at 21:20

Sven gave a great answer for moderately sized files. But if the reason that you are doing this is because reading the entire file does not fit into memory then you need to take a different approach.

It may be simplest to use an external tool like Perl or AWK to preprocess the file to only have the lines that you want, you can use pipe to read from the output of another program so that you do not have to create an intermediate file.

Another approach would be to transfer the file to a database, then select just the rows that you want from the database.

You can also loop through the file. If you explicitly open the file, then you can read a few rows at a time, keep just the ones that you want, then read the next chunk starting where you left off. The options to read.csv to skip lines and limit the number of lines to read would be helpful here.

saladi · Answer 5 · 2018-10-22T15:43:39.763

While the sed and awk solutions are great, it might be nice to do this within R itself (say on Windows machines or to avoid GNU sed vs BSD sed differences). Using readr::read_*_chunked from the tidyverse with a callback that samples every nth row works rather well:

read_tsv_sample <- function(fn, nth, ...) {
  sample_df_cb <- function(df, idx) {
    df[seq(1, nrow(df), nth), ]
  }

  read_tsv_chunked(fn,
                   ...,
                   chunk_size = 10000,
                   callback = DataFrameCallback$new(sample_df_cb)
  ) %>%
    bind_rows()
}

For example...

iris %>% write_tsv("iris.tsv")

iris %>% dim
#> [1] 150   5

"iris.tsv" %>%
    read_tsv_sample(10,
                    col_types = cols(.default = col_double())
                    ) %>%
    dim
#> [1] 15  5

Thank you for the Windows-compatible solution! Didn't realize for quite some time that pipes somehow don't work straightforwardly on this OS. — ste, Jan 08 '19 at 18:33

Importing only every Nth row from a .csv file in R

5 Answers5

Linked