-1

I want to load a html file to memory (in fact a wchar_t string). And this is the code:

size_t myGetFileSize(const wchar_t *wcPath)
{
    struct _stat fileinfo;
    _wstat(wcPath, &fileinfo);
    return (fileinfo.st_size);
}
int LoadUtf8FileToString(const wchar_t *wcFilename, wchar_t **wcBuffer)
{
    FILE* file = _wfopen(wcFilename, L"rtS, ccs=UTF-8");
    if (file == NULL)
        return (0);
    size_t filesize = myGetFileSize(wcFilename);
    if (filesize > 0)
    {
        *wcBuffer = (wchar_t*) malloc(filesize * sizeof(wchar_t));
        size_t nRead = fread(*wcBuffer, sizeof(wchar_t), filesize, file);
        realloc(*wcBuffer, nRead * sizeof(wchar_t));
    }
    fclose(file);
    return(1);
}

And when I navigate it to a iwebbrowser2, it show all page and 4 empty square at the end of page! I googled and find a string class called wstring, and using it like this way:

std::wstring wString;
/////////////////////
wString->resize(filesize);
size_t wchars_read = fread(&(wString->front()), sizeof(wchar_t), filesize, file);
wString->resize(wchars_read);
wString->shrink_to_fit();

and navigate it to iwebbrower2, everythings will be OK! But I don't like to use any class in my program! So, What is wrong with my code, please?

alk
  • 69,737
  • 10
  • 105
  • 255
Shaheen
  • 53
  • 6
  • 2
    `realloc` may not re-use the same starting address. You throw it away. (Unsure if this is "the" problem - what does "it show all page and 4 empty square at the end of page" mean?) – Jongware Sep 26 '15 at 10:47
  • *Why* do you reallocate? The buffer is allocated after the file-size, and unless there's an error reading that's the size that `fread` will read, so `realloc` may be a no-op. – Some programmer dude Sep 26 '15 at 10:50
  • 4 empty squres are unknown characters like these squares: [][][][] – Shaheen Sep 26 '15 at 10:51
  • I've used realloc(), because filesize is greater than readed number! – Shaheen Sep 26 '15 at 10:52
  • *Then* I'd say it is an invalid HTML page, at least by the standard, well ok which one ... – alk Sep 26 '15 at 10:53
  • I used wchar_t because I want to load unicode pages. – Shaheen Sep 26 '15 at 10:53
  • UTF8 can be stored in `char`-arrays. – alk Sep 26 '15 at 10:56
  • Could UTF8 loaded to char, show strings in rtl and ltr languages? – Shaheen Sep 26 '15 at 11:02
  • 1
    @Shaheen How a UTF8 string is stored in terms of variable types, and how a UTF8 string is drawn to the screen, are two completely different things. => Yes, UTF8 in char arrays can be displayed in both directions, if the drawing part can do that. – deviantfan Sep 26 '15 at 11:32

2 Answers2

0

You seem to forget two things: One is that that UTF-8 is a variable-length encoding, a character may be one byte, or it may be six bytes. You can't read it as a fixed-width encoding. The other thing you forget is that the size of the file you get is not the number of characters in the file, it's the number of bytes in the file.

In fact, if you're reading a HTML-file, odds are quite a lot of the text are single bytes, namely all the markup.

In short, the file doesn't contain filesize characters, it contains filesize bytes. And you try to read sizeof(wchar_t) * filesize bytes, which is why the fread call will return the "wrong" size.

Some programmer dude
  • 400,186
  • 35
  • 402
  • 621
0
  1. The realloc is totally obsolete. You initialize the correct amount of memory (not really, see next) and the number of bytes read can only differ if fread somehow fails and reads less characters. Then again, even if it does and you shrink (!) your allocated buffer, you forget to update the pointer, so it will still point to the original memory block. Which actually gets freed by the realloc. You may be getting away with this Undefined Behavior because (apparently) either the memory block is resized in-place, or because realloc determines resizing is not necessary.

  2. You are getting random characters at the end of your string because it is a string, and you do not allocate enough space for a terminating Zero, nor are you writing it.

  3. filesize is in bytes. Thus allocate and read in bytes, not in wchar_t units.

Jongware
  • 22,200
  • 8
  • 54
  • 100