The Apache Commons CSV project works quite well for parsing comma-separates values, tab-delimited data, and similar data formats.
My impression is that this tool reads a file entirely with the resulting line objects kept in memory. But I am not sure, I cannot find any documentation with regard to this behavior.
For parsing very large, I should like to do an incremental read, one line at a time, or perhaps a relatively small number of lines at a time, to avoid overwhelming memory limitations.
With regard only to the aspect of memory usage, the idea here is like how a SAX parser for XML reads incrementally to minimize use of RAM versus a DOM style XML parser that reads a document entirely into memory to provide tree-traversal.
Questions:
- What is the default behavior of Apache Commons CSV with regard to reading documents: Entirely into memory, or incremental?
- Can this behavior be altered between incremental and entire-document?