7

What is the most suitable type of vector to keep the bytes of a file?

I'm considering using the int type, because the bits "00000000" (1 byte) are interpreted to 0!

The goal is to save this data (bytes) to a file and retrieve from this file later.

NOTE: The files contain null bytes ("00000000" in bits)!

I'm a bit lost here. Help me! =D Thanks!


UPDATE I:

To read the file I'm using this function:

char* readFileBytes(const char *name){
    std::ifstream fl(name);
    fl.seekg( 0, std::ios::end );
    size_t len = fl.tellg();
    char *ret = new char[len];
    fl.seekg(0, std::ios::beg);
    fl.read(ret, len);
    fl.close();
    return ret;
}

NOTE I: I need to find a way to ensure that bits "00000000" can be recovered from the file!

NOTE II: Any suggestions for a safe way to save those bits "00000000" to a file?

NOTE III: When using char array I had problems converting bits "00000000" for that type.

Code Snippet:

int bit8Array[] = {0, 0, 0, 0, 0, 0, 0, 0};
char charByte = (bit8Array[7]     ) | 
                (bit8Array[6] << 1) | 
                (bit8Array[5] << 2) | 
                (bit8Array[4] << 3) | 
                (bit8Array[3] << 4) | 
                (bit8Array[2] << 5) | 
                (bit8Array[1] << 6) | 
                (bit8Array[0] << 7);

UPDATE II:

Following the @chqrlie recommendations.

#include <iostream>
#include <fstream>
#include <sstream>
#include <vector>
#include <algorithm>
#include <random>
#include <cstring>
#include <iterator>

std::vector<unsigned char> readFileBytes(const char* filename)
{
    // Open the file.
    std::ifstream file(filename, std::ios::binary);

    // Stop eating new lines in binary mode!
    file.unsetf(std::ios::skipws);

    // Get its size
    std::streampos fileSize;

    file.seekg(0, std::ios::end);
    fileSize = file.tellg();
    file.seekg(0, std::ios::beg);

    // Reserve capacity.
    std::vector<unsigned char> unsignedCharVec;
    unsignedCharVec.reserve(fileSize);

    // Read the data.
    unsignedCharVec.insert(unsignedCharVec.begin(),
               std::istream_iterator<unsigned char>(file),
               std::istream_iterator<unsigned char>());

    return unsignedCharVec;
}

int main(){

    std::vector<unsigned char> unsignedCharVec;

    // txt file contents "xz"
    unsignedCharVec=readFileBytes("xz.txt");

    // Letters -> UTF8/HEX -> bits!
    // x -> 78 -> 0111 1000
    // z -> 7a -> 0111 1010

    for(unsigned char c : unsignedCharVec){
        printf("%c\n", c);
        for(int o=7; o >= 0; o--){
            printf("%i", ((c >> o) & 1));
        }
        printf("%s", "\n");
    }

    // Prints...
    // x
    // 01111000
    // z
    // 01111010

    return 0;
}

UPDATE III:

This is the code I am using using to write to a binary file:

void writeFileBytes(const char* filename, std::vector<unsigned char>& fileBytes){
    std::ofstream file(filename, std::ios::out|std::ios::binary);
    file.write(fileBytes.size() ? (char*)&fileBytes[0] : 0, 
               std::streamsize(fileBytes.size()));
}

writeFileBytes("xz.bin", fileBytesOutput);

UPDATE IV:

Futher read about UPDATE III:

c++ - Save the contents of a "std::vector<unsigned char>" to a file


CONCLUSION:

Definitely the solution to the problem of the "00000000" bits (1 byte) was change the type that stores the bytes of the file to std::vector<unsigned char> as the guidance of friends. std::vector<unsigned char> is a universal type (exists in all environments) and will accept any octal (unlike char* in "UPDATE I")!

In addition, changing from array (char) to vector (unsigned char) was crucial for success! With vector I manipulate my data more securely and completely independent of its content (in char array I have problems with this).

Thanks a lot!

Community
  • 1
  • 1
Eduardo Lucio
  • 1,771
  • 2
  • 25
  • 43
  • What are you doing with those bytes? – NathanOliver Oct 14 '16 at 18:57
  • 2
    `unsigned char` will hold generic bytes. – AndyG Oct 14 '16 at 18:57
  • 4
    I'd use `uint8_t` – krzaq Oct 14 '16 at 18:59
  • @NathanOliver Save to a file and read this file later. Thanks! – Eduardo Lucio Oct 14 '16 at 19:01
  • I agree with krzaq - use `uint8_t`. – Chimera Oct 14 '16 at 19:04
  • @krzaq: Why would one prefer uint8_t over unsigned char for reading bytes? Legitimately curious. – AndyG Oct 14 '16 at 19:06
  • Just remember that when you use `uint8_t` you normally have a `unsigned char` as it is just a typedef. It can cause some fun output issues. – NathanOliver Oct 14 '16 at 19:06
  • @AndyG you make a good point. I guess I would want the compiler to warn me if I'm dealing with a non-standard sized byte, but that can be achieved with `static_assert`. And `unsinged char` is more likely to be of size of the underlying byte, so... Maybe my gut feeling isn't correct here, I'll have to think this over. – krzaq Oct 14 '16 at 19:08
  • @krzaq how you expect compiler to warn you about non standard sized byte? – Slava Oct 14 '16 at 19:10
  • @Slava I assume I'd get warning about a narrowing conversion from `BYTE*` to `smaller_than_byte*`. Then again, I guess you can't have a type smaller than the underlying byte... As I said, usually I use sized types first, question them later, so I may not be correct here. – krzaq Oct 14 '16 at 19:12
  • you cannot have data type smaller than `char` aka `byte` in C++ period. So what narrowinng you are talking about? – Slava Oct 14 '16 at 19:14
  • None, I guess. But in that case I'd either get a compile error (`uint8_t` unavailable) or it'd be big enough to keep a byte of information. – krzaq Oct 14 '16 at 19:16
  • Why don't you mmap the file instead of reading it? – kfsone Oct 14 '16 at 19:27
  • The file reading and writing functions are going to be fine with null characters, as will the standard containers - including even `std::string`! I'm not sure what your problem is. Maybe you're not opening the file in binary mode? – Mark Ransom Oct 14 '16 at 20:20

3 Answers3

3

Use std::vector<unsigned char>. Don't use std::uint8_t: it's won't exist on systems that don't have a native hardware type of exactly 8 bits. unsigned char will always exist; it will usually be the smallest addressable type that the hardware supports, and it's required to be at least 8 bits wide, so if you're trafficking in 8-bit bytes, it will handle the bits that you need.

If you really, really, really like the fixed-width types, you might consider std::uint_least8_t, which will always exist, and has at least eight bits, or std::uint_fast8_t, which also has at least eight bits. But file I/O traffics in char types, and mixing char and it's variants with vaguely specified "least" and "fast" types may well get confusing.

Pete Becker
  • 74,985
  • 8
  • 76
  • 165
  • Seems to me that "unsigned char" is the solution to my "00000000" bits (byte). I'll do the tests. I'll give a return! Thanks! =D – Eduardo Lucio Oct 14 '16 at 19:32
2

There are 3 problems in your code:

  • You use the char type and return a char *. Yet the return value is not a proper C string as you do not allocate an extra byte for the '\0' terminator nor null terminate it.

  • If the file may contain null bytes, you should probably use type unsigned char or uint8_t to make it explicit that the array does not contain text.

  • You do not return the array size to the caller. The caller has no way to tell how long the array is. You should probably use a std::vector<uint8_t> or std::vector<unsigned char> instead of an array allocated with new.

chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • I followed your recommendations. Seems to me that "unsigned char" is the solution to my "00000000" bits (byte). I'll do the tests. I'll give you a return! Thanks! =D – Eduardo Lucio Oct 14 '16 at 20:09
  • @EduardoLucio point 3 is the important one here. You need a way to tell how long the data is, otherwise the convention is to mark the end with a value of 0 bits. I'm assuming that's the source of your problems. Otherwise it shouldn't matter whether you're using `char`, `unsigned char`, or `uint8_t`, except for documenting what you're doing - they'll all behave the same. Keeping bytes in a `char` array is such a common thing that nobody will be confused by it. – Mark Ransom Oct 14 '16 at 23:08
1

uint8_t is the winner in my eyes:

  • it's exactly 8 bits, or 1 byte, long;
  • it's unsigned without requiring you to type unsigned every time;
  • it's exactly the same on all platforms;
  • it's a generic type that does not imply any specific use, unlike char / unsigned char, which is associated with characters of text even if it can technically be used for any purpose just the same as uint8_t.

Bottom line: uint8_t is functionally equivalent to unsigned char, but does a better job of saying this is some data of unspecified nature in the source code.

So use std::vector<uint8_t>.
#include <stdint.h> to make the uint8_t definition available.

P. S. As pointed out in the comments, the C++ standard defines char as 1 byte, and byte is not, strictly speaking, required to be the same as octet (8 bits). On such a hypothetical system, char will still exist and will be 1 byte long, but uint8_t is defined as 8 bits (octet) and thus may not exist (due to implementation difficulties / overhead). So char is more portable, theoretically speaking, but uint8_t is more strict and has wider guarantees of expected behavior.

Violet Giraffe
  • 32,368
  • 48
  • 194
  • 335