Test Under VStudio 2012 + Win7
The UTF-8 text file contains merely 5 bytes:
31 0a 32 0a 0a
in text mode it will be shown like:
1
2
The source is also straightforward:
FILE *fp;
TCHAR buf[100] ={0};
TCHAR *line;
LONG pos;
_tfopen_s(&fp, _T("...\\test.txt"), _T("r,ccs=UTF-8"));
line = _fgetts(buf, 100, fp);
pos = ftell(fp);
if(fseek(fp, pos, SEEK_SET)!=0)
perror( "fseek error");
line = _fgetts(buf, 100, fp);
pos = ftell(fp);
fclose(fp);
However when debugging the program, the 1st ftell()
returns a position value of 1 instead of 2... So when _fgetts()
get called for 2nd text line it will merely get a CR mark instead of character 2
.
I wonder if there is incompetency in handling file in "r,ccs=UTF-8"
text mode (the sample works well in "r"
mode (EDIT: NOT true! 1st ftell() returns 0. Thank Hans for pointing out)).
(It's even weirder that the ftell()
works correctly when the UTF-8 text file contains any non-ANSI characters...but let's solve the pure ANSI file firstly. And yes I've already searched through the forum but astonishingly not found similar questioner)
Best work-around til now is reading string lines in "r"
mode, and then translating them from UTF-8 encoding into Unicode ones. Any more skillful suggestion will be really appreciated.
----- UPDATE divider (2015/03/25) -----
Test Under MinGW + Win7 and GCC + CentOS
On receiving valuable comments on following key points,
- compiler implementation: Microsoft vs GNU @n.m.
inaccurate internal buffer usage for
ftell()
: @Hans Passant- fixed-length encoding (e.g."r"mode) vs variable-length encoding (e.g."r,css=UTF-8"mode)
- 1-char line-ending(single LF) vs 2-char line-ending(CR+LF) @Hans Passant, @IInspectable
I decide to test the problem under composite conditions.
Text files used
line-feed ANSI/mixed BOM encoding
1.txt single-LF pure n/a UTF-8
2.txt CR-LF* pure n/a UTF-8
3.txt CR-LF* mixed n/a UTF-8
4.txt CR-LF* mixed EFBBBF UTF-8
5.txt CR-LF* mixed FFFE UTF-16
* Except for tests under CentOS, which use single-LF only.
Source used (for GNU compiler)
FILE *fp;
wchar_t buf[100] ={0};
wchar_t *line;
long pos;
//setlocale(LC_CTYPE, "en_GB.UTF-8"); //uncomment this for GNU+CentOS
fp = fopen("....txt", "r"); //or "r,ccs=UTF-8"
pos = ftell(fp);
if(fseek(fp, pos, SEEK_SET)!= 0)
perror( "fseek error" );
line = fgetws(buf, 100, fp);
pos = ftell(fp);
if(fseek(fp, pos, SEEK_SET)!= 0) //breakpoint, check result of ftell()
perror( "fseek error" );
line = fgetws(buf, 100, fp);
pos = ftell(fp);
fclose(fp);
Result#1: "r" mode, GNU+Win7
1.txt(single LF): pos=0, NG `Really failed!(@Hans Passant, @IInspectable)
2.txt(pure ANSI): pos=7, OK
3.txt(non-ANSI): pos=13, OK(String is UTF-8 encoded)
4.txt(BOM=EFBBBF,UTF-8): pos=9, NG(BOM is also read)
5.txt(BOM=FFFE,UTF-16): pos=9, NG(BOM is also read)
Result#2: "r,ccs=UTF-8" mode, GNU+Win7, with/without setlocale()
1.txt(single LF): pos=-3!, NG(1st line can be read, UTF-16="\0x31\0xa")
2.txt(pure ANSI): pos=0, NG(1st line can be read, UTF-16=L"1abcd\n")
3.txt(non-ANSI): pos=8, NG(1st line can be read, UTF-16. but 2nd line is incorrect!)
4.txt(BOM=EFBBBF,UTF-8): pos=9, OK!(BOM ignored, String is UTF-16 = "\0x31\0x4f60\0xa". 2nd line is "\0x32\0x597d")
5.txt(BOM=FFFE,UTF-16): pos=10, OK!(BOM ignored, String is UTF-16 = "\0x31\0x4f60\0xa". 2nd line is "\0x32\0x597d")
Result#3: "r,ccs=UTF-8" mode, GNU+CentOS, WITH setlocale()
1.txt(single LF): pos=2, OK
2.txt(pure ANSI): pos=6, OK
3.txt(non-ANSI): pos=12, OK
4.txt(BOM=EFBBBF,UTF-8): not tested
5.txt(BOM=FFFE,UTF-16): not tested
Conclusion
- For GNU+CentOS, if (and only if)
setlocale()
is used,ftell()
works perfectly. I guess that's because single-LF line-ending is standard in Unix. - For Windows, however, if you use either single-LF or
"ccs=UTF-8"
mode,ftell()
will give you an inaccurate return value without warning...setlocale()
shows no difference here. However BOM attached UTF-8/UTF-16 files can be handled perfectly...which meansftell()
could have potential ability in handling variable-length encoding??
Finally, just as mentioned before,"r"
mode (w. compliance to CR+LF line-ending rule) will "save the world".~
@Hans Passant, @n.m., please kindly amend the conclusion if i missed anything.