7

Basic question, but I expected this struct to occupy 13 bytes of space (1 for the char, 12 for the 3 unsigned ints). Instead, sizeof(ESPR_REL_HEADER) gives me 16 bytes.

typedef struct {
  unsigned char version;
  unsigned int  root_node_num;
  unsigned int  node_size;
  unsigned int  node_count;
} ESPR_REL_HEADER;

What I'm trying to do is initialize this struct with some values and write the data it contains (the raw bytes) to the start of a file, so that when I open this file I later I can reconstruct this struct and gain some meta data about what the rest of the file contains.

I'm initializing the struct and writing it to the file like this:

int esprime_write_btree_header(FILE * fp, unsigned int node_size) {
  ESPR_REL_HEADER header = {
    .version       = 1,
    .root_node_num = 0,
    .node_size     = node_size,
    .node_count    = 1
  };

  return fwrite(&header, sizeof(ESPR_REL_HEADER), 1, fp);
}

Where node_size is currently 4 while I experiment.

The file contains the following data after I write the struct to it:

-bash$  hexdump test.dat
0000000 01 bf f9 8b 00 00 00 00 04 00 00 00 01 00 00 00
0000010

I expect it to actually contain:

-bash$  hexdump test.dat
0000000 01 00 00 00 00 04 00 00 00 01 00 00 00
0000010

Excuse the newbiness. I am trying to learn :) How do I efficiently write just the data components of my struct to a file?

d11wtq
  • 34,788
  • 19
  • 120
  • 195

8 Answers8

6

Microprocessors are not designed to fetch data from arbitrary addresses. Objects such as 4-byte ints should only be stored at addresses divisible by four. This requirement is called alignment.

C gives the compiler freedom to insert padding bytes between struct members to align them. The amount of padding is just one variable between different platforms, another major variable being endianness. This is why you should not simply "dump" structures to disk if you want the program to run on more than one machine.

The best practice is to write each member explicitly, and to use htonl to fix endianness to big-endian before binary output. When reading back, use memcpy to move raw bytes, do not use

char *buffer_ptr;
...
++ buffer_ptr;
struct.member = * (int *) buffer_ptr; /* potential alignment error */

but instead do

memcpy( buffer_ptr, (char *) & struct.member, sizeof struct.member );
struct.member = ntohl( struct.member ); /* if member is 4 bytes */
Potatoswatter
  • 134,909
  • 25
  • 265
  • 421
  • Thanks for that. So basically does it come down to manually building a byte array and writing that to disk, then when I read it back off disk, copying the bytes from that array back into the members of a newly allocated struct? I'm just learning really, but I would like to do this in a way that will mean the file is always guaranteed to have the same format across machines, yes. – d11wtq Apr 14 '12 at 11:26
  • 1
    @d11wtq Yep, for best portability you should use `memcpy` to copy the bytes from the array to the member and then call `ntohl` (or whatever is appropriate) to fix the byte order. – Potatoswatter Apr 14 '12 at 11:28
  • Excellent, thanks. I have some reading to do. It's hard to be newbie :) – d11wtq Apr 14 '12 at 11:33
3

That is because of structure padding, see http://en.wikipedia.org/wiki/Sizeof#Implementation

Vincenzo Pii
  • 18,961
  • 8
  • 39
  • 49
1

When you write structures as is with fwrite, you get then written as they are in memory, including the "dead bytes" inside the struct that are inserted due to the padding. Additionally, your multi-byte data is written with the endiannes of your system.

If you do not want that to happen, write a function that serializes the data from your structure. You can write only the non-padded areas, and also write multibyte data in a predictable order (e.g. in the network byte order).

Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
1

The struct is subject to alignment rules, which means some items in it get padded. Looking at it, it looks like the first unsigned char field has been padded to 4 bytes.

One of the gotchas here is that the rules can be different from system to system, so if you write the struct as a whole using fwrite in a program compiled with one compiler on one platform, and then try to read it using fread on another, you could get garbage because the second program will assume the data is aligned to fit its conception of the struct layout.

Generally, you have to either:

  1. Decide that saved data files are only valid for builds of your program that share certain characteristics (depending on the documented behaviour of the compiler you used), or

  2. Not write a whole structure as one, but implement a more formal data format where each element is written individually with its size explicitly controlled.

(A related issue is that byte order could be different; the same choice generally applies there too, except that in option 2 you want to explicitly specify the byte order of the data format.)

Edmund
  • 10,533
  • 3
  • 39
  • 57
  • Is there a good pattern to follow for point (2)? I'm trying to minimize disk I/O in everything I do here (not premature optimization, but this is actually the point of the exercise... I'm exploring tree algorithms for storing data sets on disk with low I/O overhead, just for fun. Writing four times would be inefficient, so I assume I'm supposed to copy the data into another data in C before I write it? Like an array of `unsigned char` types? – d11wtq Apr 14 '12 at 11:30
  • The writes will often be buffered (resulting in fewer actual calls to the OS to actually write stuff), so it might not be as expensive as you think. You could write into a larger buffer that corresponds to your data format, then `fwrite` that in one chunk. That's probably easier if your data is a fixed size. – Edmund Apr 15 '12 at 03:42
  • Yep, that's what I ended up doing in the end, copying the bytes in-memory into a buffer, than writing them in one chunk. Thanks. – d11wtq Apr 15 '12 at 06:03
1

This is because of something called memory alignment. The first char is extended to take 4 bytes of memory. In fact, bigger types like int can only "start" at the beginning of a block of 4 bytes, so the compiler pads with bytes to reach this point.

I had the same problem with the bitmap header, starting with 2 char. I used a char bm[2] inside the struct and wondered for 2 days where the #$%^ the 3rd and 4th bytes of the header where going...

If you want to prevent this you can use __attribute__((packed)) but beware, memory alignment IS necessary to your program to run conveniently.

Siddharth Rout
  • 147,039
  • 17
  • 206
  • 250
Eregrith
  • 4,263
  • 18
  • 39
1

Try hard not do this! The size discrepancy is caused by the padding and alignment used by compilers/linkers to optimze accesses to vars by speed. The padding and alignment rules with language and OS. Furthermore, writing ints and reading them on different hardware can be problematic due to endianness.

Write your metadata byte-by-byte in a structure that cannot be misunderstood. Null-terminated ASCII strings are OK.

Martin James
  • 24,453
  • 3
  • 36
  • 60
1

I use a awesome open source piece of code written by Troy D. Hanson called TPL: http://tpl.sourceforge.net/. With TPL you don't have any external dependency. It's as simple as including tpl.c and tpl.h into your own program and use TPL API.

Here is the guide: http://tpl.sourceforge.net/userguide.html

dAm2K
  • 9,923
  • 5
  • 44
  • 47
  • This looks interesting, but I think for my particular needs it would be overkill. It also inflates the size of the data by adding its own information to the serialized data. My file will have a strict format (a b-tree, after the initial header), so in theory I should be able to just copy data from the file back into memory, knowing exactly what the data types are. – d11wtq Apr 14 '12 at 11:42
  • +1, interesting, but including the `.c` file is the very definition of an external dependency. – Potatoswatter Apr 14 '12 at 11:42
  • @Potatoswatter the license permits you to redistribute the program, so you don't have problems with the internal dependency of tpl.c and tpl.h, you can bundle into your program. It's true that it inflates the size because of metadata and string data representation, but portability concern and fast deploy can be definitively issues. – dAm2K Apr 14 '12 at 11:46
0

If you want to write the data in a specific format, use array(s) of unsigned char ...

unsigned char outputdata[13];
outputdata[0] = 1;
outputdata[1] = 0;
/* ... of course, use data from struct ... */
outputdata[12] = 0;
fwrite(outputdata, sizeof outputdata, 1, fp);
pmg
  • 106,608
  • 13
  • 126
  • 198