3

I have a std::wstring whose size is 139,580,199 characters.

For debugging I printed it into file with this code:

std::wofstream f(L"C:\\some file.txt");
f << buffer;
f.close();

After that noticed that the end of string is missing. The created file size is 109,592,584 bytes (and the "size on disk" is 109,596,672 bytes).

Also checked if buffer contains null chars, did this:

size_t pos = buffer.find(L'\0');

Expecting result to be std::wstring::npos but it is 18446744073709551615, but my string doesn't have null char at the end so probably it's ok.

Can somebody explain, why I have not all string printed into file?

ST3
  • 8,826
  • 3
  • 68
  • 92
  • Regarding the find, are you saying your buffer doesn't end in a \0 so find oversteps the end? – doctorlove Aug 14 '13 at 09:16
  • If buffer is a `wstring`, it doesn't have to have a `L'\0'` in it. I expect we can't use `find` to locate things in basic char array [not when it's 1.4MB at least]. – Mats Petersson Aug 14 '13 at 09:22
  • `Expecting result to be std::wstring::npos but it is 18446744073709551615` Why do you assume that this is not `std::wstring::npos`? – Lightness Races in Orbit Aug 14 '13 at 10:02
  • @Mats: 133MB actually, but why do you think "we can't use `find` to locate things in basic char array"? – Lightness Races in Orbit Aug 14 '13 at 10:03
  • @LightnessRacesinOrbit A basic char array (`char[]`) doesn't have a member function `find`. (But there's no basic char array here. Although he doesn't actually show the definition of `buffer`, the surrounding text makes it pretty clear that it is a `std::wstring`.) – James Kanze Aug 14 '13 at 10:48
  • What happens if you open the file stream in binary mode? – Joris Timmermans Aug 14 '13 at 11:25
  • @JamesKanze: Right, so I presumed that Mats didn't mean to say that the entire string were of `char` array type, but was referring to an underlying implementation or some other factor. – Lightness Races in Orbit Aug 14 '13 at 11:53

1 Answers1

4

A lot depends on the locale, but typically, files on disk will not use the same encoding form (or even the same encoding) as that used by wchar_t; the filebuf which does the actual reading and writing translates the encodings according to its imbued locale. And there is only a vague relationship between the length of a string in different encodings or encoding form. (And the size the system sees doesn't correspond directly to the number of bytes you can read from the file.)

To see if everything was written, check the status of f after the close, i.e.:

f.close();
if ( !f ) {
    //  Something went wrong...
}

One thing that can go wrong is that the external encoding doesn't have a representation for one of the characters. If you're in the "C" locale, this could occur for any character outside of the basic execution character set.

If there is no error above, there's no reason off hand to assume that not all of the string has been written. What happens if you try to read it in another program? Do you get the same number of characters or not?

For the rest, nul characters are characters like any others in a std::wstring; there's nothing special about them, including when they are output to a stream. And 18446744073709551615 looks very much like the value I would expect for std::wstring::npos on a 64 bit machine.

EDIT:

Following up on Mat Petersson's comment: it's actually highly unlikely that the file ends up with less bytes than there are code points in the std::wstring. (std::wstring::size() returns the number of code points.) I was thinking in terms of bytes, not in terms of what std::wstring::size() returns. So the most likely explination is that you have some characters in your string which aren't representable in the target encoding (which probably only supports characters with code points 32-126, plus a few control characters, by default).

James Kanze
  • 150,581
  • 18
  • 184
  • 329
  • Surely the file-size can't be less number of bytes than number of characters in the string tho', so encoding shouldn't matter - or have I missed something about how `size` works on `wstring` (I did check that it's "length in characters", not length in bytes - because I was thinking something along the lines of it being UTF-8 vs. 16-bit unicode encoding). – Mats Petersson Aug 14 '13 at 11:02
  • @MatsPetersson I think you're right about the file size. I was thinking in terms of bytes in both cases: the file size can certainly be less than the number of bytes in the `std::wstring`. Theoretically, it can also be less than the number of code points in the `std::wstring` (if, for example, the string contains combining characters which can be represented by a single character in the external encoding), but it's hard to imagine a case where that would really occur. – James Kanze Aug 14 '13 at 11:13
  • @JamesKanze On windows the wide string is only 16 bit so there can be code points longer than a single "character", and if it's a locale issue it's likely that the combining characters are not representable in the default locale. A text on windows in a language with many combining characters which also has many characters outside the basic multilingual plane could have this result. Not very common, I admit, but possible. – Joris Timmermans Aug 14 '13 at 11:25
  • @MadKeithV And even with UTF-32, there are cases where a two code point sequence could result in a single character in the target set: Unicode `\u0065\u0301` maps to 0xE9 in ISO 8859-1, for example. (But do any actual implementations do this correctly?) – James Kanze Aug 14 '13 at 12:43