0

Suppose I am trying to represent the contents of a tar file in a C++ struct. Each block of a tar file can be a header (which in turn has 2 possible versions) or a payload, all in blocks of 512 bytes (padded for the headers). Each possible form of a 512-byte block would be similar to what's represented below (simplified):

             +-------+------+-----+------+---------+-----------+-----------------------+
header_v1 -> | fname | mode | uid | size |  ln_tp  |  ln_file  |+++++++++++++++++++++++|
             +-------+------+-----+------+---------+-----------+-------+--------+------+
header_v2 -> |      old_header_data      |  ln_tp  |  ln_file  | other | fields |++++++|
             +---------------------------+---------+-----------+-------+--------+------+
payload   -> |                               raw_data                                  |
             +-------------------------------------------------------------------------+

As you can see, there is some overlap in fields (such as ln_tp and ln_file) and the header_v2 also makes use of the header_v1 fields covered as old_header_data. Finally, for header padding and including the actual files information, a raw_data field is used.

I have created the following structures to model this (simplified as well to match the previous representation, array sizes will not be correct):

struct pre_posix_t {
  // Pre-POSIX.1-1988 format
  std::array<char, 100> fname;
  std::array<char, 8> mode;
  std::array<char, 8> uid;
  std::array<char, 12> size;
  fd_type_pre link_type; // fd_type_pre is an enum with the allowed values (char)
  std::array<char, 100> link_name;
};

struct ustar_t {
  std::array<char, 156> pre_posix; // first 156 bytes of Pre-POSIX.1-1988 format (thus excluding link_type and link_name)
  fd_type_ustar link_type; // fd_type_ustar is an enum with the allowed values (char, extends fd_type_pre)
  std::array<char, 100> link_name;
  std::array<char, 8> other;
  std::array<char, 32> fields;
  // ...
};

using header_t = std::variant<pre_posix_t, ustar_t>;

using raw_block_t = std::array<char, 512>;

struct tar_t {
  // ...
  std::variant<header_t, raw_block_t> data;
};

using archive_t = std::vector<tar_t>;

Is this a good representation? What would be the idiomatic way of manipulating this data in C++? I'm worried about v2's old_header_data shadowing the v1 field values, and also the overlap of link_type and link_file for the two versions, and if std::variant is the best way of working with that conditions in terms of offering a good API for manipulation while keeping the types right.

For example, if I were to construct a v2 header manually, how could I set fname, mode and also other exclusive v2 fields while working with a header_t? Perhaps creating a pre_posix_t, converting it to an std::array<char, 156> with some conversion function, and later insert it as an ustar_t's pre_posix member?

As std::variant is similar to an union, should I expect v1 and v2 to be already padded to 512 bytes?

DavSanchez
  • 831
  • 8
  • 13
  • If you can correctly serialize/deserialize your binary data, then there is no worry about struct fields shadowing just because you use `std::variant`. I think it is a matter of taste, if it's a good solution or not but for sure it's a valid approach. I can imagine you can have some `header_t deserializeHeader(BinaryData d)`, and later just `std::visit([](auto&& v){ /*get some info from header*/}, header);` to get metadata of your packet. The same can be achieved with polymorphism: `class Header` base class and derived `NewHeader`, `OldHeader`. – pptaszni Jul 19 '22 at 12:06
  • 1
    As for "offering a good api" and "how could I get/set some exclusive fields": 1. quite opinion-based and depends what exactly you want to do: if you just need to "represent the content of tar", it looks OK, just impl some public api for you `tar_t` that internally will have different implementations for different types of headers. 2. https://en.cppreference.com/w/cpp/utility/variant/visit – pptaszni Jul 19 '22 at 12:20
  • My idea is to create a simple tar library with the usual operations (creating, examining and altering tar files, extracting to the filesystem...), mostly as an excuse to explore modern C++ features, such as `std::variant` or C++17's `filesystem`. Hence my worries about `std::variant` and its members, padding, etc – DavSanchez Jul 19 '22 at 12:31
  • 1
    As with most stl containers/datatypes you don't get any padding and memory layout guarantees for `std::variant`. Nesting variants (like you did with the `header_t` makes things complicated and you will struggle to use the nice features of `std::variant` like e.g. `std::visit`. – Jakob Stark Jul 19 '22 at 12:35
  • It seems you have not decided yet what exactly you even want to do. This makes it very hard to answer your question. Basically you are asking: "I want to use `std::variant`, what can I do with it?". Probably it would be useful, if you started to solve a concrete problem, like parsing binary data into those data types or printing them to the console. – Jakob Stark Jul 19 '22 at 12:44
  • Yeah, regarding concrete tasks I am actually starting with the parsing of an existing tar file (see my former comment for what I want to do). As you say I'm equally interested in both achieving the task and exploring what (modern) C++ features I can use for it to deepen my knowledge of the language, hence the consideration to use `std::variant`, using it is not a hard requirement though. – DavSanchez Jul 19 '22 at 12:56
  • Also thanks @JakobStark for pointing to the nested variants, I have flattened it to a `std::variant` which should be easier to work with. – DavSanchez Jul 19 '22 at 13:00
  • 2
    A tar file is more like a stream and you get/put header or blocks. Putting `header_v1` and `header_v2` into a variant might be smart. But you probably shouldn't mix them with data blobs. Also don't assume the headers will be 512 byte. The c++ structures might have padding added that the on-disk version doesn't have. You might want to parse fields into a more usable structure that's bigger. You also have to handle endianess for each field so you need a proper serialize / deserialize function dealing with each field separately. – Goswin von Brederlow Jul 19 '22 at 13:57

0 Answers0