10

just a quick question. Is there a way to use read.csv to import every Nth row from a large file:

Example, a 50-60 million line file where you only need every 4th row starting at row 2.

I thought about maybe incorporating the 'seq' function, but I am not sure if that is possible.

Any suggestions?

tomathon
  • 834
  • 17
  • 32

5 Answers5

16

For a large data file the best option is to filter out unnecessary row before they get imported into R. The simplest way to do this is by the means of the OS commands, like sed, awk, grep etc. The following code reads every 4th line from the file: for example:

write.csv(1:1000, file='test.csv')

file.pipe <- pipe("awk 'BEGIN{i=0}{i++;if (i%4==0) print $1}' < test.csv ")
res <- read.csv(file.pipe)
res

> res
     X3 X3.1
1     7    7
2    11   11
3    15   15
4    19   19
5    23   23
6    27   27
7    31   31
8    35   35
df239
  • 471
  • 2
  • 4
5
read.csv("filename.csv")[c(FALSE, TRUE, FALSE, FALSE), ]

will do the trick.

This works since the logical vector is recycled until it matches the number of rows of the data frame returned by read.csv.

Sven Hohenstein
  • 80,497
  • 17
  • 145
  • 168
  • 8
    Note that this will read the entire file into memory and convert it to a data frame, then throw away the three quarters of the data that you do not want. This is a great approach for moderate sized files, but if you are doing this because the file is too large to fit into memory, then this method will not work. – Greg Snow Feb 19 '14 at 21:14
  • this will not do the trick if the reason why one wanted to kick out the other rows from the file was that they have a different number of columns – Emile Zäkiev Jan 18 '23 at 13:50
5

As @df239 suggested, its much better to filter the rows beforehand using a commandline tool.

Here's a simpler version using sed:

df <- read.csv(pipe("sed -n '2~4p' test.csv")) 

The 2~4p tells sed to get every 4th line, starting at line 2.

Scott Ritchie
  • 10,293
  • 3
  • 28
  • 64
2

Sven gave a great answer for moderately sized files. But if the reason that you are doing this is because reading the entire file does not fit into memory then you need to take a different approach.

It may be simplest to use an external tool like Perl or AWK to preprocess the file to only have the lines that you want, you can use pipe to read from the output of another program so that you do not have to create an intermediate file.

Another approach would be to transfer the file to a database, then select just the rows that you want from the database.

You can also loop through the file. If you explicitly open the file, then you can read a few rows at a time, keep just the ones that you want, then read the next chunk starting where you left off. The options to read.csv to skip lines and limit the number of lines to read would be helpful here.

Greg Snow
  • 48,497
  • 6
  • 83
  • 110
2

While the sed and awk solutions are great, it might be nice to do this within R itself (say on Windows machines or to avoid GNU sed vs BSD sed differences). Using readr::read_*_chunked from the tidyverse with a callback that samples every nth row works rather well:

read_tsv_sample <- function(fn, nth, ...) {
  sample_df_cb <- function(df, idx) {
    df[seq(1, nrow(df), nth), ]
  }

  read_tsv_chunked(fn,
                   ...,
                   chunk_size = 10000,
                   callback = DataFrameCallback$new(sample_df_cb)
  ) %>%
    bind_rows()
}

For example...

iris %>% write_tsv("iris.tsv")

iris %>% dim
#> [1] 150   5

"iris.tsv" %>%
    read_tsv_sample(10,
                    col_types = cols(.default = col_double())
                    ) %>%
    dim
#> [1] 15  5
saladi
  • 3,103
  • 6
  • 36
  • 61
  • 1
    Thank you for the Windows-compatible solution! Didn't realize for quite some time that pipes somehow don't work straightforwardly on this OS. – ste Jan 08 '19 at 18:33