0

What i have is:

  • a hex file with the bytes of a c-struct in it, orderd in big-endian
  • the struct definition as *.h file
  • the struct information as dwarf2 debug info
  • My application has to be written in C / C++. Intermediate scripts using for example python would be ok.

What i have to do is read the bytes of the hex-file and cast it into the struct type on a system that is little-endian. And during this process, i will have to reverse the bytes of each struct member.

The obvious solution would be to write a conversion function, that does byteswapping for each struct-member, but since the struct has multiple layers and ~1200 members that are changing faster than i can update my conversion function, writing that by hand is no solution.

So i could generate the conversion function automatically by:

  1. Finding and parsing the types inside multiple *.h files
  2. Iterating members of all struct-types and generate swaps for them -> without some sort of reflection api not that easy)
  3. loading the struct via the conversion function.

Since this solution seems like quite a bit of work, i was wandering if there is easier way like telling the compiler to swap it or use debug-info somehow.

Does anybody know a trick that might help in this case?

Thanks and greetings!

Remark: Changing any of the processes leading to this / changing the input-conditions or delegating responsibilities to other developers involved is not pssible.

  • Changing something about the hex-file as an input is not possible. This file comes out of some other system that will not change to fix this problem here.
  • Padding, Datatype-sizes etc. are identical. This is ensured by other measures, too. So endianess is defenetly the only problem. This is also why i see no reason against using dwarf2 info to identify the bytes of every struct member.
    • I agree that the layout of the struct is very bad. But It has some reasons why it is that way and to be short, i can/am not allowed not change that anyway because of process-reasons and backwards compatibility.

To give some more scope:

The Software that all of this is used in is deployed to multiple different embedded devices (multiple types). The hex-file containes the calibration information of the software and is thus stored in a specific system that can only output this hex-file. I am now porting the software to a little-endian device and i have to use the hex-file given from the "main" branch of software, which is big-endian, as an input.

Philipp G.
  • 11
  • 1
  • 2
    Use a proper serialization library/system. Define a grammar for the file format. – EOF May 22 '20 at 17:44
  • 4
    This sounds like a very slippery path to go. If the file was created on a different architecture with a different compiler, you are making too many assumptions here. Are you sure the structure layout, padding, data type sizes and such are identical other than endianess? – Eugene Sh. May 22 '20 at 17:45
  • _"the struct has multiple layers and ~1200 members that are changing faster than i can update my conversion function"_ That sounds like a pretty serious problem that's worth fixing in its own right. Where's the abstraction?! – Asteroids With Wings May 22 '20 at 17:46
  • @EugeneSh. I didn't originally care, but you're absolutely right about the "slippery path". Especially the `dwarf2` part is more than a little hair-raising. `dwarf2`-debug info for *which architecture/ ABI*? – EOF May 22 '20 at 17:48
  • added information above. – Philipp G. May 22 '20 at 18:03
  • Your question is inconsistent. You claim that nothing can be done about the source format, but also that it changes quickly. – EOF May 22 '20 at 18:09
  • Python is better for this, IMHO. Leverage some combination of `pyelftools` and `struct` to generate a binary parser for your struct datatype from DWARF, then sic it on the hex file. DWARF is intimidating at first, but not deadly, especially with a GUI browser such as `dwex`. – Seva Alekseyev May 22 '20 at 18:23
  • if the source format keep changing, then somebody has to change the struct, no? the one who adds the new data can also update the conversion function. this would only take time for the first time. on the other hand, as eof mentioned, a serialization library such as [kaitai](https://kaitai.io/) may be helpful. – seleciii44 May 22 '20 at 18:26
  • @EOF: The fact that it is (just) a struct and that it has to keep a certain architecture can not be changed. But it is possible to add members to the struct. – Philipp G. May 22 '20 at 19:07
  • @seleciii44: You are right, that person would be the best one to do this. But as stated i can not change any processes in this organization or delegate responsibilities. – Philipp G. May 22 '20 at 19:10
  • That's a remarkably arbitrary restriction. However, if you use an appropriate serialization system, it's quite likely that it will allow autogeneration of appropriate language bindings for serialization/deserialization, which in the case of `c` will include struct definitions for the format for various targets/ABIs. Since the struct definition *can* be changed, as you say, you simply have to invert the path from struct->deserialization format to serdes-format->struct definitions for various architectures. – EOF May 22 '20 at 19:11
  • 2
    Maybe you need to gently introduce the concept of sanity to your colleagues, for their own good. Do they really think they can just binary dump a huge struct to a file, and *somebody else* will have to write a piece of software that converts that dump to another architecture while tracking changes they make? In my shop any person thinking this way would be more than welcome to pack their stuff and leave immediately. – n. m. could be an AI May 22 '20 at 19:37
  • OBTW, when you talk of the structure having multiple layers, do you mean - nested structures/arrays, or *pointers* to data elsewhere? I hope not the latter... – Seva Alekseyev May 22 '20 at 20:42
  • 1
    *"2. Iterating members of all struct-types and generate swaps for them"* - [`magic_get`](https://github.com/apolukhin/magic_get) could probably do it. – HolyBlackCat May 22 '20 at 21:30
  • If an answer has helped you solve your problem, please consider accepting it. – horstr Dec 31 '20 at 16:50

3 Answers3

0

There is no way to tell C or C++ compiler to swap bytes from LE to BE or vice versa automatically. You really have to do it yourself. If your data structs are really huge, probably the best way is to implement automatic conversion code generation.

ivan.ukr
  • 2,853
  • 1
  • 23
  • 41
0

Recent versions of GCC allow the declaration of the desired endianness irrespective of the target platform for a source code section using the pragma scalar_storage_order or a specific type using an attribute with the same identifier. The main catch: g++ does not support this. Also, this won't work in all cases. For example, taking a pointer to a member with transparent endianness conversion leads to an error. Unless you're okay with sticking to C for struct access (it all depends on your current codebase), this is not an option.

The persistence layout is based on the original struct layout - so be it. However, a more explicit approach of serializing the structs should be preferred for exactly the reason you bring this up. Besides the endianness issue, struct packing also affects compatibility and should be explicitly specified. For persistence, a packing of 1 would be optimal. For in-memory data structures, that alignment is far from optimal in terms of performance and concurrency characteristics. Also, different platforms might have incompatible data types (e.g. sizeof(long) on 64-bit Linux/Windows - LP64 vs. LLP64). So, keeping the persistence layout separate from in-memory data structures tends to have a long list of advantages and therefore usually outweights the disadvantage of having to maintain the serialization code separately. Particularly, if portability is a major concern.

You could take advantage of C/C++-based reflection libraries or implement one yourself. In case of C, this will definitely require macros (e.g. Metaresc). In case of C++, you might actually get away your original struct definitions (e.g. Boost.Precise and Flat Reflection).

If reflection is not an option, you could generate the serialization code either by parsing the headers or debug symbols. Generally, parsing C/C++ is more complex. By moving the structs involved into dedicated headers, you might get away with a simple C/C++ parser. To make things easier, you could simplify parsing by processing the gdb output of ptype based on debug symbols. Or, you could parse debug symbols directly. With a scripting language like Python, both approaches should be feasible (pygccxml and pyelftools come to mind).

Rather than sticking to generating the serialization code as part of the build process, you could generate that code once and require updates whenever the structs change in the future. That's what I would do in a multi-platform scenario. Doing that would also spare you the pain of implementing a perfect parser that can deal with all kinds of C/C++ input, it would only have to be good enough for one-time generation.

horstr
  • 2,377
  • 12
  • 22
  • I did not know about `scalar_storage_order`, so thank you for pointing it out. However, I would nevertheless recommend against using it, except in autogenerated struct definitions created by a proper serialization system. – EOF May 22 '20 at 21:22
  • You're right. I don't recommend doing so either. I'd separate the two concerns ( serialization and in-memory representation). – horstr May 22 '20 at 21:25
0

The problem, as far as I understand it, is tricky but tractable. As far as I understand, data extraction won't be running on an embedded device, so it won't be resource constrained. I say - embrace the runtime inefficiency that desktop hardware allows, and go for easy to debug instead.

Instead of thinking of the source file as "almost what I need modulo a couple of minor adjustments", think of it as "generic binary file with an open ended, evolving schema". The schema description is the DWARF data.

What I would do: start a Python project. Use the pyelftools PyPI module to parse the DWARF. Scroll for the compile units (CUs). In each CU, scroll through the top level entries (DIEs). Look for a DW_TAG_structure_type DIE with a specific value of DW_AT_name (I hope the struct name is known in advance). Then go through the DW_TAG_member sub-DIEs. DW_AT_data_member_location will give you the offset, letting you work around the padding. Look at DW_AT_type to detect the member type (you'd have to resolve the DIE reference for that). Recurse into struct- and array-type members as necessary.

From that, generate a format string for the struct.unpack method - it can read big-endian ints seamlessly. Then use struct.pack to format it into whatever format the C++ consumer expects.

This depends on you being able to track the data file to the DWARF info of the generating executable, exactly the same build. I hope the processes of the organization allow for that.

Seva Alekseyev
  • 59,826
  • 25
  • 160
  • 281