3

I'm looking into parsing terminfo database files, which are a type of binary files. You can read about its storage format by your own and confirm the problem I'm facing.

The manual says -

The header section begins the file. This section contains six short integers in the format described below. These integers are

(1) the magic number (octal 0432);

...

...

Short integers are stored in two 8-bit bytes. The first byte contains the least significant 8 bits of the value, and the second byte contains the most significant 8 bits. (Thus, the value represented is 256*second+first.) The value -1 is represented by the two bytes 0377, 0377; other negative values are illegal. This value generally means that the corresponding capability is missing from this terminal. Machines where this does not correspond to the hardware must read the integers as two bytes and compute the little-endian value.


  • The first problem while parsing this type of input is that it fixes the size to 8 bits, so the plain old char cannot be used since it doesn't guarantees the size to be exactly 8 bits. So I was lookin 'Fixed width integer types' but again was faced with dillema of choosing b/w int8_t or uint8_t which clearly states - "provided only if the implementation directly supports the type". So what should I choose so that the type is portable enough.

  • The second problem is there is no buffer.readInt16LE() method in c++ standard library which might read 16 bytes of data in Little Endian format. So how should I proceed forward to implement this function again in a portable & safe way.

I've already tried reading it with char data type but it definitely produces garbage on my machine. Proper input can be read by infocmp command eg - $ infocmp xterm.


#include <fstream>
#include <iostream>
#include <vector>

int main()
{
    std::ifstream db(
      "/usr/share/terminfo/g/gnome", std::ios::binary | std::ios::ate);

    std::vector<unsigned char> buffer;

    if (db) {
        auto size = db.tellg();
        buffer.resize(size);
        db.seekg(0, std::ios::beg);
        db.read(reinterpret_cast<char*>(&buffer.front()), size);
    }
    std::cout << "\n";
}

$1 = std::vector of length 3069, capacity 3069 = {26 '\032', 1 '\001', 21 '\025',
  0 '\000', 38 '&', 0 '\000', 16 '\020', 0 '\000', 157 '\235', 1 '\001',
  193 '\301', 4 '\004', 103 'g', 110 'n', 111 'o', 109 'm', 101 'e', 124 '|',
  71 'G', 78 'N', 79 'O', 77 'M', 69 'E', 32 ' ', 84 'T', 101 'e', 114 'r',
  109 'm', 105 'i', 110 'n', 97 'a', 108 'l', 0 '\000', 0 '\000', 1 '\001',
  0 '\000', 0 '\000', 1 '\001', 0 '\000', 0 '\000', 0 '\000', 0 '\000',
  0 '\000', 0 '\000', 0 '\000', 0 '\000', 1 '\001', 1 '\001', 0 '\000',
....
....
Abhinav Gauniyal
  • 7,034
  • 7
  • 50
  • 93
  • 1
    post some code my friend. try reading some bytes into a buffer and take a look at it with your debugger. – john elemans Dec 24 '16 at 20:00
  • 1
    `char` is guaranteed to be the smallest addressable unit on any system (that's why `sizeof(char)` is specified to always be `1`). So for a system with 8-bit bytes `char` is guaranteed to be 8 bits. And since it's practically *all* systems made in the last 30 years or so there's really no need to worry. If you need to port your program to some old 1970's (or older) system *then* you might need to worry about it, but not otherwise. – Some programmer dude Dec 24 '16 at 20:00
  • 1
    You should use `uint8_t` or `unsigned char`. The `char` type can be `unsigned`, `signed` or `char` *depending on the compiler setting*. – Thomas Matthews Dec 24 '16 at 20:21
  • I'd always recommend to use the fixed with integer types. They should be available on any compiler that has even rudimentary support for C++11. – tambre Dec 24 '16 at 20:26
  • 1
    1) add an assert; `CHAR_BIT==8`. 2) Don't fall into the trap of the [byte order fallacy](https://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html). – wally Dec 24 '16 at 20:26
  • 1
    As one answer says, if your code doesn't work using `char` then your code is broken. – Carey Gregory Dec 25 '16 at 00:12
  • @johnelemans added code and sample output. – Abhinav Gauniyal Dec 25 '16 at 10:35

1 Answers1

2

The first problem while parsing this type of input is that it fixes the size to 8 bits, so the plain old char cannot be used since it doesn't guarantees the size to be exactly 8 bits.

Any integer that is at least 8 bits is OK. While char isn't guaranteed to be exactly 8 bits, it is required to be at least 8 bits, so as far as size is concerned, there is no problem other than you may in some cases need to mask the high bits if they exist. However, char might not be unsigned, and you don't want the octets to be interpreted as signed values, so use unsigned char instead.

The second problem is there is no buffer.readInt16LE() method in c++ standard library which might read 16 bytes of data in Little Endian format. So how should I proceed forward to implement this function again in a portable & safe way.

Read one octet at a time into an unsigned char. Assign the first octet to the variable (that is large enough to represent at least 16 bits). Shift the bits of the second octet left by 8 and assign to the variable using the compound bitwise or.

Or better yet, don't re-implement it, but use an third party existing library.

I've already tried reading it with char data type but it definitely produces garbage on my machine.

Then your attempt was buggy. There is no problem inherent with char that would cause garbage output. I recommend using a debugger to solve this problem.

eerorika
  • 232,697
  • 12
  • 197
  • 326
  • I've added the code and sample output, would you mind telling me what am I doing wrong with it. – Abhinav Gauniyal Dec 25 '16 at 10:36
  • hmm so I tried defining two `unit8_t` vars `x` & `y`, and read data using - `db.read(reinterpret_cast(&x), sizeof(x));` and then did what you suggested in another `unit16_t` - `result = x | (y << 8);` and the result was `282` which is indeed `0432` in octal. Not so sure why debugger output was like that. So does this method works on both little and big endian machine? – Abhinav Gauniyal Dec 25 '16 at 11:12
  • Yes, this converts little endian to native endian, regardless of what the native endianness is. – eerorika Dec 25 '16 at 12:14
  • "Or better yet, don't re-implement it, but use an existing library." What library? Is there any standard one? – BarbaraKwarc Jan 13 '17 at 12:04
  • 1
    @BarbaraKwarc I meant to refer to third party libraries specifically. There are no functions for this in the c++ standard library, but there is in the POSIX standard C library. Boost has very nice set of tools for this as well. – eerorika Jan 13 '17 at 12:08