0

So, I need to read a unicode file first, then transform it using Huffman's algorithm (effectively compress it) and write it to a new file.

Reason for unicode is special chars like hyphen - the longer dash and other - without unicode, reading and writing using ifstream/ofstream and unsigned char translates the hyphen into 3 individual chars, and when I want to descompress the file, it adds chars that weren't there.

Now, I use std::wifstream and std::wofstream to do this, like so:

size_t bitsNum = 65536;
std::wifstream in("a", std::ios::binary);
std::wofstream out("b", std::ios::binary);

void compress(std::wifstream &in, std::wofstream &out) {

in.clear();
in.seekg(0);

uint64_t size = 0;

for (wchar_t i = 0; i < nodes.size(); ++i) {
    size += nodes.at(i).probability * codes.at(nodes.at(i).value).length;
}

std::cout << "Final size: " << size << '\n';

wchar_t c, w = 0, length, lengthW = 0;
std::bitset<bitsNum> bits;

while (!in.eof() && in.good()) {
    c = in.get();
    bits = codes.at(c).bits;
    length = codes.at(c).length;

    for (wchar_t i = 0; i < length; ++i) {
        if (lengthW == 16) {
            lengthW = 0;
            out << w;
            w = 0;
        }

        w <<= 1;
        w |= bits.test(length - i - 1) & 1;
        ++lengthW;
    }
}

if (lengthW == 16) {
    lengthW = 0;
    out << w;
    w = 0;
}
else if (lengthW) {
    w <<= 16 - lengthW;
    out << w;
    w = 0;
}

out.flush();

if (DECOMPRESS) decompress();

}

The nodes object consists of the frequency distribution for each character that was read from the file, and the codes object consists of bit codes for each of the characters that have to be transformed.

This results in the fact, that I can read a file no problem, but when I write back the new bits, nothing gets written to the file.

I tried imbuing a locale, that did not help, also set a global locale.

Other than piping the wchar_t into the wofstream, I tried to use .put() function and also .write() - no luck here.

Any ideas on what may be wrong?

PS: I am allowed to only use standard c++17 with no extensions.

Thanks!

dodancs
  • 357
  • 1
  • 5
  • 15
  • _"Imagine, that the..."_: you need to show the code. What's the result type of `transform`? – Richard Critten Nov 15 '19 at 12:25
  • @RichardCritten sorry, added exact code used. – dodancs Nov 15 '19 at 12:30
  • The problem is that you pass character data into a function that expects octets as input. You must encode the character data into octets first. UTF-8 is an exceedingly popular encoding scheme. – daxim Nov 15 '19 at 13:23
  • Whether characters are or are not Unicode is completely and totally irrelevant. If "it adds chars that weren't there" that means there's a bug in the compression or decompression logic, and that has nothing to do with Unicode. Huffman encoding doesn't care what it's encoding. It's just a binary glop. So, the premise of this question is faulty, and the logic should still be using narrow streams. Having said that, this "exact code" still fails to meet the requirements for a [mre] (something I would expect someone who's been on stackoverflow.com, for longer than me, to know already). – Sam Varshavchik Nov 15 '19 at 13:23

0 Answers0