27

I have a file with 15 million lines (will not fit in memory). I also have a small vector of line numbers - the lines that I want to extract.

How can I read-out the lines in one pass?

I was hoping for a C function that does it on one pass.

Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
Aleksandr Levchuk
  • 3,751
  • 4
  • 35
  • 47

5 Answers5

27

The trick is to use connection AND open it before read.table:

con<-file('filename')
open(con)

read.table(con,skip=5,nrow=1) #6-th line
read.table(con,skip=20,nrow=1) #27-th line
...
close(con)

You may also try scan, it is faster and gives more control.

mbq
  • 18,510
  • 6
  • 49
  • 72
  • 8
    Definitely use `scan` or `readLines` for speed. `read.table` does a lot of checking of data types, dimensions, etc. Also would probably be best not to use `c` as a variable in R as it is one of the most commonly used functions (concatenate). – hatmatrix Aug 23 '11 at 11:03
  • So this will only read the file from disk once? How would read.table or scan behave if `skip=20` is called before `skip=5`? – Aleksandr Levchuk Aug 23 '11 at 19:14
  • 2
    @Aleksandr It is quite simple; first `read.table` will consume skip+nrow lines from the connection and the next `read.table` will start from this point. And so on. – mbq Aug 23 '11 at 20:20
5

If it's a binary file

Some discussion is here: Reading in only part of a Stata .DTA file in R

If it's a CSV or other text file

If they are contiguous and at the top of the file, just use the ,nrows argument to read.csv or any of the read.table family. If not, you can combine the ,nrows and the ,skip arguments to repeatedly call read.csv (reading in a new row or group of contiguous rows with each call) and then rbind the results together.

Community
  • 1
  • 1
Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
  • Is it possible to do it in one pass? – Aleksandr Levchuk Aug 23 '11 at 05:57
  • @Aleksandr Absolutely. But `read.table` doesn't appear to be written that way. If you look at the source for `read.table` it appears that it wouldn't be too hard to modify. Maybe others will have better answers on a pre-existing function that does this. – Ari B. Friedman Aug 23 '11 at 06:06
4

If your file has fixed line lengths then you can use 'seek' to jump to any character position. So just jump to N * line_length for each N you want, and read one line.

However, from the R docs:

 Use of seek on Windows is discouraged.  We have found so many
 errors in the Windows implementation of file positioning that
 users are advised to use it only at their own risk, and asked not
 to waste the R developers' time with bug reports on Windows'
 deficiencies.

You can also use 'seek' from the standard C library in C, but I don't know if the above warning also applies!

Spacedman
  • 92,590
  • 12
  • 140
  • 224
3

Before I was able to get an R solution/answer, I've done it in Ruby:

#!/usr/bin/env ruby

NUM_SEQS = 14024829

linenumbers = (1..10).collect{(rand * NUM_SEQS).to_i}

File.open("./data/uniprot_2011_02.tab") do |f|
  while line = f.gets
    print line if linenumbers.include? f.lineno 
  end
end

runs fast (as fast as my storage can read the file).

Aleksandr Levchuk
  • 3,751
  • 4
  • 35
  • 47
  • 1
    I don't know why this was down-voted. Given the ease with which Ruby can be called from R, e.g., http://www.r-bloggers.com/calling-ruby-perl-or-python-from-r/ , and similarly the ease of calling R from Ruby, people shouldn't shy away from using the simplest solution available. – Carl Witthoft Aug 23 '11 at 11:24
  • Well, of course, you can use [http://beakernotebook.com] and do it in C++. But if you can find an equally good, or even a 10% slower native solution in R, I'd stick with that, unless you are writing throwaway code. Otherwise everyone who is reading the code has to understand basic Ruby. Otherwise you can't port the code to another system without worrying about compatibility. There may be security issues etc.etc.etc. Invoking another process, when it's not necessary, is generally not a good idea. I'd say, it would've been better to just add a comment to the question (+ a link) – Sergey Orshanskiy Oct 04 '15 at 16:55
  • Well, if you are asking for a solution in one language and provide the solution in another, consider at least providing and extra code to call one language from another. – JelenaČuklina Nov 18 '15 at 10:26
2

I compile a solution based on the discussions here.

scan(filename,what=list(NULL),sep='\n',blank.lines.skip = F)

This will only show you number of lines but will read in nothing. If you really want to skip the blank lines, you could just set the last argument to TRUE.

API
  • 480
  • 3
  • 10