1

I have 6 large tsv's matrices of 60gb (uncompressed) containing 20million rows x 501 columns: the first index/integer column that is basically the row number (so not even necessary), 500 columns are numerical (float, 4 decimals e.g. 1.0301). All tsv's have the same number of rows that correspond to each other.

I need to extract rows on rownumber.

I need is to extract up to 5,000 contiguous rows or up to 500 non-contiguous rows;so not millions. Hopefully, also have some kind of compression to reduce the size of 60gb so maybe no SQL? What would be the best way to do this?

  • One method I tried is to seperate them into 100 gzipped files, index them using tabix, and then query them, but this is too slow for my need (500 random rows took 90 seconds).

  • I read about the ff package, but have not found how to index by the first column?

  • Are there other ways ?

Thanks so much.

tafelplankje
  • 563
  • 1
  • 7
  • 21

1 Answers1

0

I will use fread() from data.table package

Using the parameters skip and nrows you can play with the starting line to read (skip) or the number of rows to read (nrows)

If you want to explore the tidyverse approach I recommend you this solution R: Read in random rows from file using fread or equivalent?

LocoGris
  • 4,432
  • 3
  • 15
  • 30
  • I doubt fread will be fast querying by row as it needs to read in the file structure everyime, but will give it a try, thanks. I am more looking for a memory mapped file structure that indexes/queries by row number – tafelplankje Feb 15 '19 at 22:18