6

I have a very large multi-gigabyte file which is too costly to load into memory. The ordering of the rows in the file, however, are not random. Is there a way to read in a random subset of the rows using something like fread?

Something like this, for example?

data <- fread("data_file", nrows_sample = 90000)

This github post suggests one possibility is to do something like this:

fread("shuf -n 5 data_file")

This does not work for me, however. Any ideas?

Parseltongue
  • 11,157
  • 30
  • 95
  • 160

2 Answers2

4

Using the tidyverse (as opposed to data.table), you could do:

library(readr)
library(purrr)
library(dplyr)

# generate some random numbers between 1 and how many rows your files has,
# assuming you can ballpark the number of rows in your file
#
# Generating 900 integers because we'll grab 10 rows for each start, 
# giving us a total of 9000 rows in the final
start_at  <- floor(runif(900, min = 1, max = (n_rows_in_your_file - 10) ))

# sort the index sequentially
start_at  <- start_at[order(start_at)]

# Read in 10 rows at a time, starting at your random numbers, 
# binding results rowwise into a single data frame
sample_of_rows  <- map_dfr(start_at, ~read_csv("data_file", n_max = 10, skip = .x) ) 
crazybilly
  • 2,992
  • 1
  • 16
  • 42
2

If your data file happens to be a text file this solution using the package LaF could be useful:

library(LaF)

# Prepare dummy data
mat <- matrix(sample(letters,10*1000000,T), nrow = 1000000)

dim(mat)
#[1] 1000000      10

write.table(mat, "tmp.csv",
    row.names = F,
    sep = ",",
    quote = F)

# Read 90'000 random lines
start <- Sys.time()
random_mat <- sample_lines(filename = "tmp.csv",
    n = 90000,
    nlines = 1000000)
random_mat <- do.call("rbind",strsplit(random_mat,","))
Sys.time() - start
#Time difference of 1.135546 secs    

dim(random_mat)
#[1] 90000    10
tobiasegli_te
  • 1,413
  • 1
  • 12
  • 18