2

I'm trying to read a very big Fortran unformatted binary file with python. This file contains 2^30 integers.

I find that the record markers is confusing (the first one is -2147483639), anyway I have achieved to recover the data structure ( those wanted integers are all similar, thus differ from record markers ) and write the code below ( with help of here ).

However, we can see the markers at the begin and the end of each record are not the same. Why is that?

Is it because the size of the data is too long ( 536870910 = (2^30 - 4) / 2 ) ? But ( 2^31 - 1 ) / 4 = 536870911 > 536870910.

Or just some mistakes made by the author of the data file?

Another question, what's the type of the marker at begin of a record , int or unsigned int?

fp = open(file_path, "rb")

rec_len1, = struct.unpack( '>i', fp.read(4) )
data1 = np.fromfile( fp, '>i', 536870910)
rec_end1, = struct.unpack( '>i', fp.read(4) )

rec_len2, = struct.unpack( '>i', fp.read(4) )
data2 = np.fromfile( fp, '>i', 536870910)
rec_end2, = struct.unpack( '>i', fp.read(4) )

rec_len3, = struct.unpack( '>i', fp.read(4) )
data3 = np.fromfile( fp, '>i', 4)
rec_end3, = struct.unpack( '>i', fp.read(4) )
data = np.concatenate([data1, data2, data3])

(rec_len1,rec_end1,rec_len2,rec_end2,rec_len3,rec_end3)

here's the values of record lenth readed as showed above:

(-2147483639, -2176, 2406, 589824, 1227787, -18)
Community
  • 1
  • 1
Syrtis Major
  • 3,791
  • 1
  • 30
  • 40
  • The format of an unformatted file is processor (Fortran compiler) specific, and in many cases specific to compiler options. You don't state what your Fortran compiler is. One convention that is common to a few compilers is documented [here](http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/composerxe/compiler/fortran-win/GUID-E36C2463-1514-4E4E-B88A-769AB0326C57.htm) – IanH Mar 25 '13 at 08:21
  • Thanks a lot. I don't know which compiler was used, the file was generated in 2008 by someone. Maybe using f2py is better way? – Syrtis Major Mar 25 '13 at 11:16
  • We can use a fortran io programme compiled with ifort to access the data successfully. This means the file follows the conventions of intel? – Syrtis Major Mar 25 '13 at 11:27
  • [here](http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/composerxe/compiler/fortran-win/index.htm#GUID-64D43E4C-68E7-4C48-8B50-B49F1F7DA46C.htm) said that Maximum Record Lengths of Variable-length Record is 2147483640 (2**31-8). That's probably why the first two records store 536870910 integers. – Syrtis Major Mar 25 '13 at 13:31
  • Are you sure its big endian? ( '>1' ) Are you sure each record contains 536870910 values, so that you can hard code it? One thing to try as a shot in the dark is assume your headers might be 8 bytes .. – agentp Mar 25 '13 at 20:12
  • Another thing to double-check is in the Fortran program that you use to access the data: is the file opened for sequential or direct access? If it's direct access, I've found that you don't have the record length markers at all which might explain these results. – Tim Whitcomb Mar 26 '13 at 16:24
  • one very simple thing to look at is the file size, since you know you have 2^30 integers in 3 records the header size is (file_size-4*2^30)/6 .. my modern gfortran produces 8 byte headers. Of course if you get zero you maybe have a direct acess file. – agentp Mar 26 '13 at 22:07
  • I'm quite sure for the big endian and the length of record (I have load every 4 byte data and compare them). And as george's suggestion, I have checked the size of the file(4294967320L), there's the exactly extra place of (3*2)*4=24 byte for record markers. Thanks a lot for your comments. – Syrtis Major Apr 01 '13 at 05:33
  • hmm, maybe unsigned short..? (try a capital >I in unpack) ..? – agentp Apr 01 '13 at 14:57
  • struct.unpack('>I',struct.pack('>i',-2147483639))[0]/4 = 536870914 ... – agentp Apr 01 '13 at 15:17
  • Now I know what happened. Length of the first record is 2147483639, the negative sign indicats the presence of a preceding subrecord. So the first record ends with the first 3 bytes of certain integer and the second record begins with the rest 1 byte. Length of the three records are 2147483639, 2147483639, 18 bytes, not nessary to be multiple of 4. – Syrtis Major Apr 24 '13 at 13:58
  • So, in my case the record marker is a signed short. ;) – Syrtis Major Apr 24 '13 at 14:06

3 Answers3

2

Finally, things seem to be more clear.

Here is a Intel Fortran Compiler User and Reference Guides, see the section Record Types:Variable-Length Records.

For a record length greater than 2,147,483,639 bytes, the record is divided into subrecords. The subrecord can be of any length from 1 to 2,147,483,639, inclusive.

The sign bit of the leading length field indicates whether the record is continued or not. The sign bit of the trailing length field indicates the presence of a preceding subrecord. The position of the sign bit is determined by the endian format of the file.

A subrecord that is continued has a leading length field with a sign bit value of 1. The last subrecord that makes up a record has a leading length field with a sign bit value of 0. A subrecord that has a preceding subrecord has a trailing length field with a sign bit value of 1. The first subrecord that makes up a record has a trailing length field with a sign bit value of 0. If the value of the sign bit is 1, the length of the record is stored in twos-complement notation.

After many essays, I realized that I was mislead by twos-complement notation, the record marker just change the sign according to the rules above, instead changing to its twos-complement notation when the sign bit is 1. Anyway it's also possible that my data was created with a diffrent compiler.

Below is the solution.

The data is larger than 2GB, so it's devided into several subrecords. As we see the first record start marker is -2147483639, so the lenth of the first record is 2147483639 which is exactly the maximum length of subrecord, not 2147483640 as I thought nor 2147483638 the twos-complement notation of -2147483639.

If we skip 2147483639 bytes to read the record end marker, you will get 2147483639, as it's the first subrecord whose end marker is positive.

Below is the code to check the record markers:

fp = open(file_path, "rb")
while 1:
    prefix, = struct.unpack( '>i', fp.read(4) )
    fp.seek(abs(prefix), 1)    #or read |prefix| bytes data as you want
    suffix, = struct.unpack( '>i', fp.read(4) )
    print prefix, suffix
    if abs(suffix) - abs(prefix): 
        print "suffix != prefix!"
        break
    if prefix > 0: break

And screen prints

-2147483639 2147483639
-2147483639 -2147483639
18 -18

We can see the record begin marker and end marker always are the same except the sign. Length of the three records are 2147483639, 2147483639, 18 bytes, not nessary to be multiple of 4. So the first record ends with the first 3 bytes of certain integer and the second record begins with the rest 1 byte.

Syrtis Major
  • 3,791
  • 1
  • 30
  • 40
0

I found that using f2py is a more convenient way for python to access fortran data. However, the strange behavior of the record marks remains a question. At least we can avoid diving into (sometimes confusing ) fortran unformatted file structure. And it matches well with numpy.

F2PY Users Guide and Reference Manual is here. Here's a example fortran source file for opening and closing file, reading integer 1-D array and float 2-D array. Note the comments begin with !f2py, they are helpful to make f2py more 'clever'.

To use it, you need wrap it into a module and import into python session. Then you can call these functions just as those python functions.

!ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
!cc                                                         cc
!cc      FORTRAN MODULE for PYTHON PROGRAM CALLING          cc
!cc                                                         cc
!ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc

!Usage: 
!   Compile:   f2py -c fortio.f90 -m fortio
!   Import:    from fortio import *
!       or     import fortio
!Note:
!   Big endian: 1; Little endian: 0


!cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
SUBROUTINE open_fortran_file(fileUnit, fileName, endian, error)
  implicit none

  character(len=256) :: fileName
  integer*4 :: fileUnit, error, endian
  !f2py integer*4 optional, intent(in) :: endian=1
  !f2py integer*4 intent(out) :: error

  if(endian .NE. 0) then
     open(unit=fileUnit, FILE=fileName, form='unformatted', status='old', &
          iostat=error, convert='big_endian')
  else
     open(unit=fileUnit, FILE=fileName, form='unformatted', status='old', &
          iostat=error)
  endif
END SUBROUTINE 

!cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
SUBROUTINE read_fortran_integer4(fileUnit, arr, leng)
  implicit none

  integer*4 :: fileUnit, leng
  integer*4 :: arr(leng)
  !f2py integer*4 intent(in) :: fileUnit, leng 
  !f2py integer*4 intent(out), dimension(leng), depend(leng) :: arr(leng)

  read(fileUnit) arr
END SUBROUTINE

!cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
SUBROUTINE read_fortran_real4(fileUnit, arr, row, col)
  implicit none

  integer*4 :: fileUnit, row, col
  real*4 :: arr(row,col)
  !f2py integer*4 intent(in):: fileUnit, row, col
  !f2py real*4 intent(out), dimension(row, col), depend(row, col) :: arr(row,col)

  read(fileUnit) arr
END SUBROUTINE

!cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
SUBROUTINE close_fortran_file(fileUnit, error)
  implicit none

  integer*4 :: fileUnit, error
  !f2py integer*4 intent(in) :: fileUnit
  !f2py integer*4 intent(out) :: error

  close(fileUnit, iostat=error)
END SUBROUTINE 
Syrtis Major
  • 3,791
  • 1
  • 30
  • 40
0

Since this question seems to come up often.. this is a python utilty code to scan a binary file and determine if it is (might be) a fortran unformatted sequential access file. It works by trying several header formats. Of course since the "unformatted" format isn't standard there could be other varients but this should hit the most common ones.

note the left brackets are escaped so you might need to change & #060; back to a 'less than' sign if you screen copy this.


def scanfbinary(hformat,file,fsize):
 """ scan a file to see if it has the simple structure typical of
     an unformatted sequential access fortran binary:
     recl1,<data of length recl1 bytes>,recl1,recl2,<data of length recl2 bytes>,recl2 ...
 """
 import struct
 print 'scan type',hformat,
 if 'qQ'.find(hformat[1])>=0:  hsize=8
 elif 'iIlL'.find(hformat[1])>=0:  hsize=4
 if hformat[0] == '<':  endian='little'
 elif hformat[0] == '>':  endian='big'
 print '(',endian,'endian',hsize,'byte header)',
 f.seek(0)
 nrec = 0
 while fsize > 0:
  h0=struct.unpack(hformat,f.read(hsize))[0]
  if h0 < 0 :   print 'invalid integer ',h0; return 1
  if h0 > fsize - 2*hsize:
   print 'invalid header size ',h0,' exceeds file size ',fsize
   if nrec > 0:print 'odd perhaps a corrupe file?'
   return 2
# to read the data replace the next line with code to read h0 bytes..
# eg 
#  import numpy
#  dtype = numpy.dtype('<i')
#  record=numpy.fromfile(f,dtype,h0/dtype.itemsize) 
  f.seek(h0,1)   
  h=struct.unpack(hformat,f.read(hsize))[0]
  if h0!=h :  print 'unmatched header';   return 3
  nrec+=1
  if nrec == 1:print
  if nrec < 10:print 'read record',nrec,'size',h
  fsize-=(h+2*hsize)
 print 'successfully read ',nrec,' records with unformatted fortran header type',hformat
 return 0
f=open('binaryfilename','r')
f.seek(0,2)
fsize=f.tell()
res=[scanfbinary(hformat,f,fsize) for hformat in ('<q','>q','<i','>i')]
if res.count(0)==0:
 print 'no match found, file size ',fsize, 'starts..'
 f.seek(0)
 for i in range(0,12): print f.read(2).encode('hex_codec'),
 print 

agentp
  • 6,849
  • 2
  • 19
  • 37
  • Thanks for the answer. It will work for most case. However in my situation, the first record marker h0 is -2147483639. – Syrtis Major Mar 30 '13 at 08:38
  • 1
    my code tries to read in four different formats, no way it gives you that number every time. Try it and post the output if it doesn't work.. – agentp Mar 31 '13 at 11:52
  • 1
    `>>> res=[testfort.scanfbinary(hformat,f,fsize) for hformat in ('q','i')]` `scan type q ( big endian 8 byte header) invalid integer -9223371998178104305` `scan type i ( big endian 4 byte header) invalid integer -2147483639` – Syrtis Major Apr 01 '13 at 05:43
  • When the data written into a record is larger than 2GB, it will be devided into subrecords. The first subrecord length would be negative and one should use the absolute value of it. Your coed is pretty handy and thank you. – Syrtis Major Apr 24 '13 at 13:49