0

I have a program to create a compressed file using LZW algorithm and employing hash tables. My compressed file currently contains integers corresponding to the index of hashtable. The maximum integer in this compressed file is around 46000, which can easily be represented by 16 bits. Now when i convert this "compressedfile.txt" to a binary file "binary.bin"(to further reduce the file size) using the following code, I get 32 bit integers in my "binary.bin" file. E.g. if there is a number 84 in my compressed file, it converts to 5400 0000 in my binary file.

std::ifstream in("compressedfile.txt");
std::ofstream out("binary.bin", ios::out | std::ios::binary);

int d;
while(in >> d)
{out.write((char*)&d, 4);}

My question is can't I discard the ending '0000' in '5400 0000' which uses up an extra 2 bytes in my file. This is the case with every integer since my max integer is 46000 which can be represented using only 2 bytes. Is there any code that can set the base of my binary file that way? I hope my question is clear.

Anuj Kumar
  • 130
  • 1
  • 9
  • 2
    Simply write/read a `uint16_t` (you might have to include `stdint.h`) – Marius Oct 28 '13 at 15:02
  • Replace "out.write((char*)&d, 4);" by "out.write((char*)&d + difference_sizeof_int_and_int16, sizeof(int16));" - Warning: This one ignores any endian! –  Oct 28 '13 at 15:07
  • Note: be aware of issue with little endian vs big endian... – Jarod42 Oct 28 '13 at 15:09

1 Answers1

5

It's writing exactly what you tell it to, 4 bytes at the address of d (an integer, 32 bit on many platforms). Use a 16 bit type and write 2 bytes instead:

uint16_t d; // unsigned to ensure it's large enough to hold your max value of 46000
while (in >> d) out.write(reinterpret_cast<char*>(&d), sizeof d);

Edit: As pointed out in the comments, for this code and the data it generates to be portable across processor architectures you should pick an endianness convention for the output. I'd suggest using htons() to convert your uint16_t to network byte order which is widely available, though not (yet) part of the C++ standard.

mattnewport
  • 13,728
  • 2
  • 35
  • 39
  • @benjymous erm, no? That's formatted output. It prints textual representations. – R. Martinho Fernandes Oct 28 '13 at 15:06
  • That won't do what he wants - he wants to write binary data not text. – mattnewport Oct 28 '13 at 15:06
  • 1
    +1 this answers the OP's question. That said, writing platform-dependent byte-orderings, though not likely a concern for the OP, will be an issue if this thing is ever read from a non-like byte-ordered platform. Ideally, the values should be network ordered before writing and host-ordered on reading. *All* multibyte integrals written/read should be similarly handled. Regardless, for what the OP asked this answer is correct. – WhozCraig Oct 28 '13 at 15:11
  • It sounds like your editor may be interpreting the file as UTF16 (16 bit unicode characters) rather than binary. Try using a hex editor or other binary viewer. – mattnewport Oct 28 '13 at 15:39
  • @mattnewport If you would also be so kind to answer, now that I have got this file, how do i read from it? The usual functions like myfile.get() and myfile>>str don't seem to work. How do i read from this file? – Anuj Kumar Oct 28 '13 at 15:54
  • The equivalent function to ostream::write is istream::read, you need to use these when dealing with binary data rather than formatted input / output. – mattnewport Oct 28 '13 at 15:56
  • @mattnewport so i use in.read(reinterpret_cast(&d), sizeof d); ?? where d is defined the same way as above? – Anuj Kumar Oct 28 '13 at 16:09
  • Yes, and use ntohs() for endianness conversion if you choose to implement that. – mattnewport Oct 28 '13 at 16:14