4

I'm trying to parse a very large file using FParsec. The file's size is 61GB, which is too big to hold in RAM, so I'd like to generate a sequence of results (i.e. seq<'Result>), rather than a list, if possible. Can this be done with FParsec? (I've come up with a jerry-rigged implementation that actually does this, but it doesn't work well in practice due to the O(n) performance of CharStream.Seek.)

The file is line-oriented (one record per line), which should make it possible in theory to parse in batches of, say, 1000 records at a time. The FParsec "Tips and tricks" section says:

If you’re dealing with large input files or very slow parsers, it might also be worth trying to parse multiple sections within a single file in parallel. For this to be efficient there must be a fast way to find the start and end points of such sections. For example, if you are parsing a large serialized data structure, the format might allow you to easily skip over segments within the file, so that you can chop up the input into multiple independent parts that can be parsed in parallel. Another example could be a programming languages whose grammar makes it easy to skip over a complete class or function definition, e.g. by finding the closing brace or by interpreting the indentation. In this case it might be worth not to parse the definitions directly when they are encountered, but instead to skip over them, push their text content into a queue and then to process that queue in parallel.

This sounds perfect for me: I'd like to pre-parse each batch of records into a queue, and then finish parsing them in parallel later. However, I don't know how to accomplish this with the FParsec API. How can I create such a queue without using up all my RAM?

FWIW, the file I'm trying to parse is here if anyone wants to give it a try with me. :)

Brian Berns
  • 15,499
  • 2
  • 30
  • 40
  • 1
    can you give some records as an example – Just another metaprogrammer May 11 '15 at 21:33
  • Each record in that file is about 10K characters long, so I can't really paste one here, but the file format spec has a good small example: http://samtools.github.io/hts-specs/VCFv4.2.pdf. The records I'm parsing are the last 5 in that example - the ones that start with "20". – Brian Berns May 11 '15 at 21:42
  • Just to clarify: I know how to parse the records, and my parser works fine for most files. I'm just having trouble scaling it up to gigantic inputs. – Brian Berns May 11 '15 at 22:28
  • 2
    It sounds like each record is self-contained, correct? That is, you don't need information from some past (or future) record to fully parse an individual record. If that's the case, why not just read lines as a seq{} and Seq.iter parseRecord? Let F#/CLR worry about batching/buffering and just focus on line-oriented record parsing. Once you're done with a record the garbage collector should handle it. My sense is you should be able to process extremely large files this way with a minimal memory footprint. – Robert Sim May 12 '15 at 06:36
  • I like that idea. Will give it a try. – Brian Berns May 12 '15 at 13:27

1 Answers1

5

The "obvious" thing that comes to mind, would be pre-processing the file using something like File.ReadLines and then parsing one line at a time.

If this doesn't work (your PDF looked, like a record is a few lines long), then you can make a seq of records or 1000 records or something like that using normal FileStream reading. This would not need to know details of the record, but it would be convenient, if you can at least delimit the records.

Either way, you end up with a lazy seq that the parser can then read.

Daniel Fabian
  • 3,828
  • 2
  • 19
  • 28