Trick to read data from hard drive faster between sucessive compilations

Question

I am developing code with a compiled language (Fortran 95) that does certain calculations on a huge galaxy catalog. Each time I implement some change, I compile and run the code, and it takes about 3 minutes just reading the ASCII file with the galaxy data from disk. This is a waste of time.

Had I started this project in IDL or Matlab, then it would be different, because the variables containing the array data would be kept in memory between different compilations.

However, I think something could be done to speed up that unnerving reading from disk, like having the files in a fake RAM partition or something.

I changed the title slightly in order to better match the answer I have elected. That answer does not fit my original answer 100% (no clues on how to do a partition in RAM) but it is too useful not to award it the green tag. I will post another question about a RAM directory specifically. — Mephisto, Feb 22 '17 at 21:00
Have you made some actual measurements, or are you just guessing that the bottleneck is the disk I/O rather than for example parsing the ASCII data format? Assuming you have enough RAM for a ramdisk, you probably also have plenty available for a disk cache, so there's a good chance you're already reading mostly from memory anyway. The accepted answer hints at this being the case. — Dan Mašek, Feb 22 '17 at 21:02
@DanMašek I ignore the specifics of what you propose, but I think it takes the same time the first time I execute the code after logging in, than the successive readings. Also, while the file is reading, the TOP command does not show 100% cpu activity, so probably the issue is in the hard drive reading. — Mephisto, Feb 22 '17 at 21:13

score 6 · Accepted Answer · answered Feb 20 '17 at 20:07

Instead of going into details on RAM disks I propose you switch from ASCII databases to Binary ones. here is a very simplistic example... An array of random numbers, stored as ASCII (ASCII.txt) and as binary date (binary.bin):

program writeArr
  use,intrinsic :: ISO_Fortran_env, only: REAL64
  implicit none
  real(REAL64),allocatable :: tmp(:,:)
  integer :: uFile, i

  allocate( tmp(10000,10000) )

  ! Formatted read  
  open(unit=uFile, file='ASCII.txt',form='formatted', &
       status='replace',action='write')
  do i=1,size(tmp,1)
    write(uFile,*) tmp(:,i)
  enddo !i
  close(uFile)

  ! Unformatted read  
  open(unit=uFile, file='binary.bin',form='unformatted', &
       status='replace',action='write')
  write(uFile) tmp
  close(uFile)

end program

Here is the result in terms of sizes:

 :> ls -lah ASCII.txt binary.bin 
-rw-rw-r--. 1 elias elias 2.5G Feb 20 20:59 ASCII.txt
-rw-rw-r--. 1 elias elias 763M Feb 20 20:59 binary.bin

So, you save a factor of ~3.35 in terms of storage. Now comes the fun part: reading it back in...

program readArr
  use,intrinsic :: ISO_Fortran_env, only: REAL64
  implicit none
  real(REAL64),allocatable :: tmp(:,:)
  integer :: uFile, i
  integer :: count_rate, iTime1, iTime2

  allocate( tmp(10000,10000) )

  ! Get the count rate
  call system_clock(count_rate=count_rate)

  ! Formatted write  
  open(unit=uFile, file='ASCII.txt',form='formatted', &
       status='old',action='read')

  call system_clock(iTime1)
  do i=1,size(tmp,1)
    read(uFile,*) tmp(:,i)
  enddo !i
  call system_clock(iTime2)
  close(uFile)
  print *,'ASCII  read ',real(iTime2-iTime1,REAL64)/real(count_rate,REAL64)

  ! Unformatted write  
  open(unit=uFile, file='binary.bin',form='unformatted', &
       status='old',action='read')
  call system_clock(iTime1)
  read(uFile) tmp
  call system_clock(iTime2)
  close(uFile)
  print *,'Binary read ',real(iTime2-iTime1,REAL64)/real(count_rate,REAL64)

end program

The result is

 ASCII  read    37.250999999999998     
 Binary read    1.5460000000000000

So, a factor of >24!

So instead of thinking of anything else, please switch to a binary file format first.

Of course, there are many binary formats out there more suitable to complex data, and much more portable than simple Fortran binary files. Amongst those I would suggest you consider HDF5. — Alexander Vogt, Feb 20 '17 at 20:16
This is an awesome answer that will be useful for many people (+1), but the original file has a mix of types (1st column is character, 2nd is an integer, other columns are floating point...) and I defined a derived type. I am not sure how to handle that when writing a binary file. I have also other codes where I went to binary to be able to store and read 16000x16000 double precision arrays in a reasonable amount of time. — Mephisto, Feb 20 '17 at 23:06
However you were reading the ASCII file into a derived type will likely work for an unformatted file. For example, read the integer components, then the real components, and so on. If you have trouble, ask a different question. — Ross, Feb 20 '17 at 23:46
@Mephisto The structure can be written out just fine if the structure if correct. So try it on a small fragment. Obviously you will need a binary to ASCII and ASCII to binary to interface with the rest of your community, but that should be easy enough. — Holmz, Feb 21 '17 at 23:21

Trick to read data from hard drive faster between sucessive compilations

1 Answers1