Data serialization in C?

Question

I have this structure which I want to write to a file:

typedef struct
{
    char* egg;
    unsigned long sausage;
    long bacon;
    double spam;
} order;

This file must be binary and must be readable by any machine that has a C99 compiler.

I looked at various approaches to this matter such as ASN.1, XDR, XML, ProtocolBuffers and many others, but none of them fit my requirements:

small
simple
written in C

I decided then to make my own data protocol. I could handle the following representations of integer types:

unsigned
signed in one's complement
signed in two's complement
signed in sign and magnitude

in a valid, simple and clean way (impressive, no?). However, the real types are being a pain now.

How should I read float and double from a byte stream? The standard says that bitwise operators (at least &, |, << and >>) are for integer types only, which left me without hope. The only way I could think was:

int sign;
int exponent;
unsigned long mantissa;

order my_order;

sign = read_sign();
exponent = read_exponent();
mantissa = read_mantissa();

my_order.spam = sign * mantissa * pow(10, exponent);

but that doesn't seem really efficient. I also could not find a description of the representation of double and float. How should one proceed before this?

Nils Pipenbrinck · Accepted Answer · 2015-11-29T12:52:19.100

3

If you want to be as portable as possible with floats you can use frexp and ldexp:

void WriteFloat (float number)
{
  int exponent;
  unsigned long mantissa;

  mantissa = (unsigned int) (INT_MAX * frexp(number, &exponent);

  WriteInt (exponent);
  WriteUnsigned (mantissa);
}

float ReadFloat ()
{
  int exponent = ReadInt();
  unsigned long mantissa = ReadUnsigned();

  float value = (float)mantissa / INT_MAX;

  return ldexp (value, exponent);
}

The Idea behind this is, that ldexp, frexp and INT_MAX are standard C. Also the precision of an unsigned long is usually at least as high as the width of the mantissa (no guarantee, but it is a valid assumption and I don't know a single architecture that is different here).

Therefore the conversion works without precision loss. The division/multiplication with INT_MAX may loose a bit of precision during conversion, but that's a compromise one can live with.

edited Nov 29 '15 at 12:52

answered Apr 09 '11 at 22:49

Nils Pipenbrinck

83,631
31
151
221

Wow! Thank you very much! `frexp` seem to be just what I needed! – pudiva Apr 09 '11 at 23:02
after reading your post a couple of times, I began to wonder if there isn't an more efficient way. Teh IEEE-754 format carries a exponent of a base 10. If this is teh format in my machine, extracting a base 2 log would trigger a unnecessary conversion, no? – pudiva Apr 10 '11 at 00:14
2

`INT_MAX` is implementation defined, so it's not safe for decoding to rely on it. – u0b34a0f6ae Oct 26 '11 at 14:42
Maybe it's ok to use `2147483647` which is the least `INT_MAX` acceptable in Standard C. – u0b34a0f6ae Oct 26 '11 at 14:48
@kalzer.se: if you don't agree with me about portability, why don'you propose a portable way to solve the issue? – Nils Pipenbrinck Oct 26 '11 at 18:00
Scaling by `INT_MAX` introduces unnnessary rounding errors. `INT_MAX + 1.0` is better. Rather than `unsigned long, unsigned`, use one. Suggest `unsigned long` and then use `(LONG_MAX/2 + 1)*2.0f` for scaling. – chux - Reinstate Monica Aug 28 '21 at 17:08
1

@u0b34a0f6ae [Maybe it's ok to use 2147483647 which is the least INT_MAX acceptable in Standard C.](https://stackoverflow.com/questions/5608370/data-serialization-in-c/5608466#comment9651330_5608466). `INT_MAX is at _least_ 32767. Could be 2147483647, 9223372036854775807, or others. Using `INT_MAX` breaks portability between systems here. – chux - Reinstate Monica Aug 28 '21 at 17:23

Jason · Answer 2 · 2011-04-09T22:58:59.820

2

If you are using IEEE-754 why not access the float or double as a unsigned short or unsigned long and save the floating point data as a series of bytes, then re-convert the "specialized" unsigned short or unsigned long back to a float or double on the other side of the transmission ... the bit-data would be preserved, so you should end-up with the same floating point number after transmission.

edited Apr 09 '11 at 22:58

answered Apr 09 '11 at 22:35

Jason

31,834
7
59
78

2

Acessing it as a long seem a undefined behaviour, however, casting it to a array of unsigned chars should be ok. However, is it guaranteed that `double` and `float` will have the same representation on every machine? – pudiva Apr 09 '11 at 22:40
1

If they are complying with the IEEE-754 standard, then the only issue should be endianess ... so you should store those values in some-type of consistent endian format so that you always interpret your byte-array as either little-endian or big-endian no matter what the platform is. The same would be true for a cast to a `unsigned short` or `unsigned long`. – Jason Apr 09 '11 at 22:45
Endianness already shows us that teh representations may be different. This seem inconsistent. – pudiva Apr 09 '11 at 22:58
BTW, after thinking about this, rather than a cast, maybe you should use a `union` data-type that contains both a `float` and a `unsigned short`, or a `double` and a `unsigned long` (or `long long` if that is 64-bits wide on your platform). – Jason Apr 09 '11 at 22:58
I'm not sure how you're going to get away from endian problems for multi-byte datatypes. For instance networking applications have to deal with that all the time and they've standardized on big-endian. – Jason Apr 09 '11 at 23:41
**Integer** types have well defined data representations, **real** types do not. =o/ – pudiva Apr 09 '11 at 23:43
So if I'm understanding you correctly, you're wanting to serialize your data into a format that can be deserialized without having to know how the data was originally serialized? – Jason Apr 10 '11 at 03:31
I need to serialize data without knowing how it is represented in memory (as teh standard does not specify this). Deserialization will, of course, need how teh data is written on teh file. – pudiva Apr 11 '11 at 11:17
Oh, integers have well defined representation, floats were teh real problem but that was solved! Thanks! Maybe i will put my lib on sourceforge so teh poor souls needing this will not have to worry anymoar. – pudiva Apr 11 '11 at 11:19

score 2 · Answer 3 · answered Apr 09 '11 at 22:45

2

If you are using C99 you can output real numbers in portable hex using %a.

answered Apr 09 '11 at 22:45

lhf

70,581
9
108
149

Thanks for pointing that! However, I need binary representations and printf is just too heavy for me – pudiva Apr 09 '11 at 22:48

score 2 · Answer 4 · answered Oct 27 '11 at 11:17

This answer uses Nils Pipenbrinck's method but I have changed a few details that I think help to ensure real C99 portability. This solution lives in an imaginary context where encode_int64 and encode_int32 etc already exist.

#include <stdint.h>     
#include <math.h>                                                         

#define PORTABLE_INTLEAST64_MAX ((int_least64_t)9223372036854775807) /* 2^63-1*/             

/* NOTE: +-inf and nan not handled. quickest solution                            
 * is to encode 0 for !isfinite(val) */                                          
void encode_double(struct encoder *rec, double val) {                            
    int exp = 0;                                                                 
    double norm = frexp(val, &exp);                                              
    int_least64_t scale = norm*PORTABLE_INTLEAST64_MAX;                          
    encode_int64(rec, scale);                                                    
    encode_int32(rec, exp);                                                      
}                                                                                

void decode_double(struct encoder *rec, double *val) {                           
    int_least64_t scale = 0;                                                     
    int_least32_t exp = 0;                                                       
    decode_int64(rec, &scale);                                                   
    decode_int32(rec, &exp);                                                     
    *val = ldexp((double)scale/PORTABLE_INTLEAST64_MAX, exp);                    
}

This is still not a real solution, inf and nan can not be encoded. Also notice that both parts of the double carry sign bits.

int_least64_t is guaranteed by the standard (int64_t is not), and we use the least perimissible maximum for this type to scale the double. The encoding routines accept int_least64_t but will have to reject input that is larger than 64 bits for portability, the same for the 32 bit case.

Instead of `2^63-1` ( which once converted to a `double` loses precision and likely becomes 2^63), use `norm*9223372036854775808.0 /* 2^63 */`. — chux - Reinstate Monica, Aug 28 '21 at 17:11
I like the idea of _not_ using `INT_MAX` here, but least fixed sized integer types. — chux - Reinstate Monica, Aug 28 '21 at 17:18

score 1 · Answer 5 · edited May 23 '17 at 12:32

The C standard doesn't define a representation for floating point types. Your best bet would be to convert them to IEEE-754 format and store them that way. Portability of binary serialization of double/float type in C++ may help you there.

Note that the C standard also doesn't specify a format for integers. While most computers you're likely to encounter will use a normal two's-complement representation with only endianness to be concerned about, it's also possible they would use a one's-complement or sign-magnitude representation, and both signed and unsigned ints may contain padding bits that don't contribute to the value.

I read this carefully and implemented every possibility teh standard allows! Thanks! — pudiva, Apr 09 '11 at 23:03

Data serialization in C?

5 Answers5

Linked