0

I have a csv file that has about 2 million lines, and about 150 columns of data. Total file size is about 1.3 GB. That's about 300 million array members.

I started with a 3.5 million line file, and through trial and error learned that FORTRAN would not even compile unless the array was defined at 3.9 million or less. 4 million, no go. Bus error/core dumps.

So anyway, I thought my 2 million line file would work. I read a few posts about a 2 GB limit. However, when I print out the line number when reading the data in, I only get to 250,000 or so before it just ends. Strangely enough, I have an almost identical file (used the split command), and it only gets to 85,000 before conking out. Not sure why so different, same number of characters per line.

Is there anything I can do to get this data read in? It would be a major pain to compile all the data hundreds of times.

mrjimoy_05
  • 3,452
  • 9
  • 58
  • 95
  • 2
    On what system do you run? How do you compile your program? Please show relevant source code and compilation command. If on Linux, did you compile with `gfortran -Wall -g`, did you run under the `gdb` debugger? – Basile Starynkevitch Oct 24 '12 at 05:17
  • 2
    I'm sceptical of the need to read such a large file into memory in a single gulp; it's generally a much better strategy to read large data sets chunk-by-chunk: read a chunk, distil some data, discard a chunk, repeat. But if you do need to read large data sets do yourself a favour and store them in a binary format, that is `unformatted` in Fortran. – High Performance Mark Oct 24 '12 at 08:52
  • @HighPerformanceMark -- wasn't `stream` access added to the standard in one of the more resent revisions? That's another binary format which would take less disk space than `unformatted` and is still reasonably easy to comprehend... – mgilson Oct 24 '12 at 11:02
  • @mgilson: good point, though I defend my sloppiness by pointing out that `stream` access can be used on `formatted` and `unformatted` files so it's not, strictly, an alternative to either. – High Performance Mark Oct 24 '12 at 11:18

1 Answers1

2

This isn't a property of Fortran per se, but of your particular compiler and OS. Which is why you should provide that information.

Re the bus error: likely the array is being placed on the stack and you have run out of stack space. Various OS'es provide ways of increasing the stack size. Many compilers provide options so that large arrays are placed into the heap. You can also try declaring the array "allocatable" and allocating it. That last suggestion assumes that you are using a Fortran 95 compiler, rather than a FORTRAN 77 one.

There is also how you declare the integer variable used for indexing. If a loop in your program exceeds 2,147,483,647 you will need to use a variable more than four bytes in size. We can only guess since you don't show any of your source code.

M. S. B.
  • 28,968
  • 2
  • 46
  • 73