1

I'm starting in c++ and I need to read a binary file.

I know the structure of file, i.e, each file line is composed by:

'double';'int8';'float32';'float32';'float32';'float32';'float32';'float32';'int8';'float32';'float32';'float32';'float32';'int8';'float32'

or in byte numbers:

8 1 4 4 4 4 4 4 1 4 4 4 4 1 4

I made some code but is too obsolete... Here is the code:

void test1 () {
const char *filePath = "C:\20110527_phantom19.elm2";    
double *doub;           
int *in;
float *fl;
FILE *file = NULL;     
unsigned char buffer;

if ((file = fopen(filePath, "rb")) == NULL)
    cout << "Could not open specified file" << endl;
else
    cout << "File opened successfully" << endl;

// Get the size of the file in bytes
long fileSize = getFileSize(file);
cout << "Tamanho do ficheiro: " << fileSize;
cout << "\n";
// Allocate space in the buffer for the whole file
doub = new double[1];
in = new int[1];
fl = new float[1];
// Read the file in to the buffer
//fread(fileBuf, fileSize, 1, file);

//fscanf(file, "%g %d %g", doub[0],in[0],fl[0]);

fread(doub, 8, 1, file);
//cout << doub[0]<< " ";
fseek (file ,8, SEEK_SET);
fread(&buffer,1,1,file);
//printf("%d ",buffer);
fread(fl,4,1,file);
//cout << fl[0]<< " ";
fread(fl,4,1,file);
//cout << fl[0]<< " ";
fread(fl,4,1,file);
//cout << fl[0]<< " ";
fread(fl,4,1,file);
//cout << fl[0]<< " ";
fread(fl,4,1,file);
//cout << fl[0]<< " ";
fread(fl,4,1,file);
//cout << fl[0]<< " ";
fread(&buffer,1,1,file);
//printf("%d ",buffer);
fread(fl,4,1,file);
//cout << fl[0]<< " ";
fread(fl,4,1,file);
//cout << fl[0]<< " ";
fread(fl,4,1,file);
//cout << fl[0]<< " ";
fread(fl,4,1,file);
//cout << fl[0]<< " ";
fread(&buffer,1,1,file);
//printf("%d ",buffer);
fread(fl,4,1,file);
//cout << fl[0]<< "\n";

cin.get();
//delete[]fileBuf;
fclose(file); 
}

How can I change this to an efficient way?

Shahbaz
  • 46,337
  • 19
  • 116
  • 182
luiserta
  • 43
  • 1
  • 2
  • 8
  • You say it's a binary file. How are the lines separated? – jrok Apr 16 '12 at 17:39
  • In fact, there are no lines in the file. The file itself is constituted by the same sandard byte sequence, ie, 8 1 4 4 4 4 4 4 1 4 4 4 4 1 4 8 1 4 4 4 4 4 4 1 4 4 4 4 1 4 ... – luiserta Apr 17 '12 at 00:31

3 Answers3

2

What's the problem when you can easily read whole structs with your custom format and have the fields automatically filled with correct values?

struct MyDataFormat {
  double d;
  int8 i1;
  float32 f[6];
  ..
};

MyDataFormat buffer;

fread(&buffer, sizeof(MyDataFormat), 1, file);
Jack
  • 131,802
  • 30
  • 241
  • 343
  • +1, he should probably use some "pack" directive, like `__attribute__((packed))`. – mfontanini Apr 16 '12 at 16:58
  • The problem is that this doesn't work. It might accidentally work for a given version of a given compiler, on a given platform, but it certainly doesn't work generally. – James Kanze Apr 16 '12 at 17:30
  • But, in this case, when you are using fread, you are reading only 1 element and this element, in this case, will have 55 bytes and the goal is reading 15 elements at the same time (55 bytes). – luiserta Apr 17 '12 at 14:37
  • Just do a `for (int i = 0; i < 15; ++i)` and do 15 `fread`s, where's the problem? – Jack Apr 17 '12 at 15:09
  • @JamesKanze According to what you need to do this can effectively be your best solution. I used it SO MANY times in years of development :) – Jack Apr 17 '12 at 15:11
  • Now, I read the elements but the integer one, does not appear. I declared it like `unsigned char i1;`. What is wrong? – luiserta Apr 17 '12 at 16:39
  • @Jack Except that it only works in very limited cases. It's a guaranteed problem down the road in most cases. (At it usually fails immediately on Intel's, because the binary format happens to be standard Internet format, which is different from the internal format of Intel.) – James Kanze Apr 17 '12 at 16:44
1

If each line is the same format I would probably read a line at a time into a buffer and then have a function that pulled that buffer apart into separate elements - easier to understand, easier to test, works with larger files and is possibly more efficent to do fewer reads.

Martin Beckett
  • 94,801
  • 28
  • 188
  • 263
  • If the data has a binary format, as seems to be the case, `getline` isn't going to work. – James Kanze Apr 16 '12 at 17:31
  • @JamesKanze - reading a line doesn't mean getline(). Just that it's inefficient to read 1 byte then 4 bytes then ... but it's also bad to read an entire 4Gb file into memory that is then swapped out. Reading a block (ie line) at a time is in this case an easy solution. Best of all is mmap() but I didn't want to confuse things – Martin Beckett Apr 16 '12 at 17:43
  • Reading a fixed size block is good. Reading a line is bad, when the concept of line doesn't exist. (Of course, there's also nothing particularly inefficient about reading it byte by byte. Both `FILE*` and `std::istream` will take care of any buffering issues.) – James Kanze Apr 16 '12 at 17:53
  • @JamesKanze - the OP referred to it as a `line` so it was simpler to say `line` rather than record/block/data unit when answering their question. – Martin Beckett Apr 16 '12 at 17:59
1

In addition to the "structure" of the file, we need to know the format of the data types involved, and what you mean by "line", if the format isn't a text format. In general, however, you will 1) have to read an appropriately sized block, and then extract each value from it, according to the specified format. For integral values, it's fairly easy to extract an unsigned integral value using shifts; for int8, in fact, you just have to read the byte. For most machines, just casting the unsigned integer into the correspondingly sized signed type will work, although this is explicitly not guaranteed; if the unsigned char is greater than CHAR_MAX, you'll have to scale it down to get the
appropriate value: something like -(UCHAR_MAX+1 - value) should do the trick (for chars—for larger types, you also have to worry about the fact that UINT_MAX+1 will overflow).

If the external format is IEEE, and that's also what your machine uses (the usual case for Windows and Unix machines, but rarely the case for mainframes), then you can read an unsigned 4 or 8 byte integer (again, using shifts), and type pun it, something like:

uint64_t
get64BitUInt( char const* buffer )
{
    return reinterpret_cast<double>(
          ((buffer[0] << 52) & 0xFF)
        | ((buffer[1] << 48) & 0xFF)
        | ((buffer[2] << 40) & 0xFF)
        | ((buffer[3] << 32) & 0xFF)
        | ((buffer[4] << 24) & 0xFF)
        | ((buffer[5] << 16) & 0xFF)
        | ((buffer[6] <<  8) & 0xFF)
        | ((buffer[7]      ) & 0xFF) );
}

double
getDouble( char const* buffer )
{
    uint64_t retval = get64BitUInt( buffer );
    return *reinterpret_cast<double*>( &retval );
}

(This corresponds the usual network byte order. If your binary format uses another convention, you'll have to adapt it. And the reinterpret_cast depends on implementation defined behavior; you may have to rewrite it as:

double
getDouble( char const* buffer )
{
    union
    {
        double          d;
        uint64_t        i;
    }               results;
    results.i = get64BitUInt( buffer );
    return results.d;
}

. Or even use memcpy to copy from a uint64_t into a double.)

If your machine doesn't use IEEE floating point, and the external format is IEEE, you'll have to pick up the 8 byte word as an 8 byte unsigned int (unsigned long long), then extract the sign, exponent and mantissa according to the IEEE format; something like the following:

double
getDouble( char const* buffer )
{
    uint64_t            tmp( get64BitUInt( buffer );
    double              f = 0.0 ;
    if ( (tmp & 0x7FFFFFFFFFFFFFFF) != 0 ) {
        f = ldexp( ((tmp & 0x000FFFFFFFFFFFFF) | 0x0010000000000000),
                   (int)((tmp & 0x7FF0000000000000) >> 52) - 1022 - 53 ) ;
    }
    if ( (tmp & 0x8000000000000000) != 0 ) {
        f = -f ;
    }
    return f;
}

Don't do this until you're sure you'll need it, however.

James Kanze
  • 150,581
  • 18
  • 184
  • 329
  • In fact, there are no lines in the file. The file itself is constituted by the same sandard byte sequence, ie, 8 1 4 4 4 4 4 4 1 4 4 4 4 1 4 8 1 4 4 4 4 4 4 1 4 4 4 4 1 4 ... – luiserta Apr 17 '12 at 00:47
  • @luiserta That's what I understood. It sounded like a binary format, and binary formats don't contain "lines" (which in C++ are defined as sequences of printable characters terminated by a `'\n'`). Do **not** use `getline` on this file; one of the bytes in a float or a double may look like a `'\n'`. Use `std::istream::read` for the exact number of bytes in the block, then parse as above. (And there's an error in my first version of `getDouble`, which I'll fix with an edit.) – James Kanze Apr 17 '12 at 08:04