5

I have a problem which concern with large number of small integers (actually decimal digits). What is the space efficient way to store such a data?

Is it good idea to use std::bitset<4> to store one decimal digit?

dann
  • 119
  • 1
  • 3
  • 2
    What gives sizeof for your bitset? –  Jul 08 '17 at 12:05
  • 5
    Are all your number single digits? How many number you have? Do you need to access digits by their positions? Are you willing to trade performance for space efficiency? **Obviously, before doing that kind of optimisation, you have to figure out what you need!** In most cases, using appropriate integer or char type might be more than enough. – Phil1970 Jul 08 '17 at 12:12
  • 2
    Also, are there patterns you can exploit in the sequence of digits, or parts of that sequence? Are some of the digits likely to occur more frequently than others? If so, you might be able to use some encoding, rather than a direct representation. – Peter Jul 08 '17 at 12:34
  • Remember that your program needs to spend execution time packing and unpacking before any math, comparison or I/O is used. You will be trading performance for space. – Thomas Matthews Jul 08 '17 at 18:05
  • Unless this is a tightly space constrained system, I don't recommend compression. I suggest using external storage rather than compression. Compression will insert more code and with more code there may be more injected defects that need to be resolved. – Thomas Matthews Jul 08 '17 at 18:07

3 Answers3

3

Depending on how space-efficient it has to be and how efficient the retrieval should be, I see two possibilities:

  • Since a vector of std::bitset<4> is (as far as I know) stored in an unpacked setting (each bitset is stored in a memory word, either 32 or 64 bit), you should probably at least use a packed representation like using a 64 bit word to store 16 digits:

    store (if the digit was not stored before):
    block |= digit << 4 * index
    load:
    digit = (block >> 4 * index) & 0xF
    reset:
    block &= ~(0xF << 4 * index);
    

A vector of these 64 bit words (uint64_t) together with some access methods should be easy to implement.

  • If your space requirements are even tighter, you could e.g. try packing 3 digits in 10 bits (at most 1024) using divisions and modulo, which would be a lot less time-efficient. Also the alignment with 64 bit words is much more difficult, so I would only recommend this if you need to get the final 16% improvement, at most you can get something like 3.3 bits per digit.
Tobias Ribizel
  • 5,331
  • 1
  • 18
  • 33
3

If you want a very compact way, then no, using bitset<4> is a bad idea, because bitset<4> will use at least one byte, instead of 4 bits.

I'd recommend using std::vector<std::uint32_t>

You can store multiple digits in an uint32_t. Two usual ways:

  1. Use for 4 bits for each digit, and use bit operations. This way you can store 8 digits in 4 bytes. Here, set/get operations are pretty fast. Efficiency: 4bit/digit
  2. Use base 10 encoding. uint32_t max value is 256^4-1, which is capable to store 9 digits in 4 bytes. Efficiency: 3.55bit/digit. Here, if you need to set/get all the 9 digits, then it is almost as fast than the previous version (as division by 10 will be optimized by a good compiler, no actual division will be done by the CPU). If you need random access, then set/get will be slower than the previous version (you can speed it up with libdivide).

If you use uint64_t instead of uint32_t, then you can store 16 digits with the first way (same 4bit/digit efficiency), and 19 digits with the second way: 3.36bit/digit efficieny, which is pretty close to the theoretical minimum: ~3.3219bit/digit

geza
  • 28,403
  • 6
  • 61
  • 135
1

Is it good idea to use std::bitset<4> to store one decimal digit?

Yes, in principle that's a good idea. It's a well known optimization and called BCD encoding.

(actually decimal digits). What is the space efficient way to store such a data?

You can compact the decimal digit representation by using one nibble of the occupied byte. Also math might be applied optimized, vs. ASCII representation of digits or such.

The std::bitset<4> won't serve that well for compacting the data.
std::bitset<4> will still occupy a full byte.

An alternative data structure I could think of is a bitfield

// Maybe #pragma pack(push(1))
struct TwoBCDDecimalDigits {
    uint8_t digit1 : 4;
    uint8_t digit2 : 4;
};
// Maybe #pragma pack(pop)

There is even a library available, to convert this format to a normalized numerical format supported at your target CPU architecture:


Another way I could think of is to write your own class:

class BCDEncodedNumber {
    enum class Sign_t : char {
        plus = '+' ,
        minus = '-'
    };
    std::vector<uint8_t> doubleDigitsArray;
public:
    BCDEncodedNumber() = default;
    BCDEncodedNumber(int num) {
        AddDigits(num); // Implements math operation + against the
                        // current BCD representation stored in 
                        // doubleDigitsArray.
    }    
};
user0042
  • 7,917
  • 3
  • 24
  • 39
  • All objects in C++ will necessarily take 1 byte at minimum, so if you use a `bitset<4>` you aren't gaining anything over just using a `char`. – Matteo Italia Jul 08 '17 at 12:17
  • @MatteoItalia OK, that was a misunderstanding. I didn't mean to use one full byte per digits 0-9 but use the nibbles of a single byte. Let me rectify my answer. – user0042 Jul 08 '17 at 12:27
  • Definitely better, although IMHO using a bitfield here is just a hindrance - say that you have a vector of `TwoBCDDecimalDigits`, now to access the n-th digit you have to get the element n/2 and then have an if over n%2 to decide which member to read. If it was just an array of `uint8_t`, as in your second solution, you could just shift, without having branches (n-th digit = `array[n>>1]>>((n&1)<<2)`). – Matteo Italia Jul 08 '17 at 13:04
  • @Matteo Well, I wasn't in the mood right now to elaborate on the implementation details of `BCDEncodedNumber`. Also there's an appropriate, portable library available. But hopefully that Q&A may push everyone into the right direction what needs to be researched. – user0042 Jul 08 '17 at 13:20