0

Test Under VStudio 2012 + Win7

The UTF-8 text file contains merely 5 bytes:

31 0a 32 0a 0a

in text mode it will be shown like:

1
2

The source is also straightforward:

FILE *fp;
TCHAR buf[100] ={0};
TCHAR *line;
LONG pos;

_tfopen_s(&fp, _T("...\\test.txt"), _T("r,ccs=UTF-8"));
line = _fgetts(buf, 100, fp);
pos = ftell(fp);

if(fseek(fp, pos, SEEK_SET)!=0)
    perror( "fseek error");
line = _fgetts(buf, 100, fp);
pos = ftell(fp);

fclose(fp);

However when debugging the program, the 1st ftell() returns a position value of 1 instead of 2... So when _fgetts() get called for 2nd text line it will merely get a CR mark instead of character 2.

I wonder if there is incompetency in handling file in "r,ccs=UTF-8" text mode (the sample works well in "r" mode (EDIT: NOT true! 1st ftell() returns 0. Thank Hans for pointing out)).
(It's even weirder that the ftell() works correctly when the UTF-8 text file contains any non-ANSI characters...but let's solve the pure ANSI file firstly. And yes I've already searched through the forum but astonishingly not found similar questioner)

Best work-around til now is reading string lines in "r" mode, and then translating them from UTF-8 encoding into Unicode ones. Any more skillful suggestion will be really appreciated.


----- UPDATE divider (2015/03/25) -----

Test Under MinGW + Win7 and GCC + CentOS

On receiving valuable comments on following key points,

  • compiler implementation: Microsoft vs GNU @n.m.
  • inaccurate internal buffer usage for ftell(): @Hans Passant

    • fixed-length encoding (e.g."r"mode) vs variable-length encoding (e.g."r,css=UTF-8"mode)
    • 1-char line-ending(single LF) vs 2-char line-ending(CR+LF) @Hans Passant, @IInspectable

I decide to test the problem under composite conditions.

Text files used

         line-feed    ANSI/mixed    BOM       encoding  
1.txt    single-LF    pure          n/a       UTF-8  
2.txt    CR-LF*       pure          n/a       UTF-8  
3.txt    CR-LF*       mixed         n/a       UTF-8  
4.txt    CR-LF*       mixed         EFBBBF    UTF-8  
5.txt    CR-LF*       mixed         FFFE      UTF-16  
* Except for tests under CentOS, which use single-LF only.

Source used (for GNU compiler)

FILE *fp;
wchar_t buf[100] ={0};
wchar_t *line;
long pos;

//setlocale(LC_CTYPE, "en_GB.UTF-8"); //uncomment this for GNU+CentOS

fp = fopen("....txt", "r"); //or "r,ccs=UTF-8"
pos = ftell(fp);

if(fseek(fp, pos, SEEK_SET)!= 0)
    perror( "fseek error" );
line = fgetws(buf, 100, fp);
pos = ftell(fp);

if(fseek(fp, pos, SEEK_SET)!= 0) //breakpoint, check result of ftell()
    perror( "fseek error" );
line = fgetws(buf, 100, fp);
pos = ftell(fp);

fclose(fp);

Result#1: "r" mode, GNU+Win7

1.txt(single LF): pos=0, NG  `Really failed!(@Hans Passant, @IInspectable)
2.txt(pure ANSI): pos=7, OK  
3.txt(non-ANSI): pos=13, OK(String is UTF-8 encoded)
4.txt(BOM=EFBBBF,UTF-8): pos=9, NG(BOM is also read)
5.txt(BOM=FFFE,UTF-16): pos=9, NG(BOM is also read)

Result#2: "r,ccs=UTF-8" mode, GNU+Win7, with/without setlocale()

1.txt(single LF): pos=-3!, NG(1st line can be read, UTF-16="\0x31\0xa")
2.txt(pure ANSI): pos=0, NG(1st line can be read, UTF-16=L"1abcd\n")
3.txt(non-ANSI): pos=8, NG(1st line can be read, UTF-16. but 2nd line is incorrect!)
4.txt(BOM=EFBBBF,UTF-8): pos=9, OK!(BOM ignored, String is UTF-16 = "\0x31\0x4f60\0xa". 2nd line is "\0x32\0x597d")
5.txt(BOM=FFFE,UTF-16): pos=10, OK!(BOM ignored, String is UTF-16 = "\0x31\0x4f60\0xa". 2nd line is "\0x32\0x597d")

Result#3: "r,ccs=UTF-8" mode, GNU+CentOS, WITH setlocale()

1.txt(single LF): pos=2, OK
2.txt(pure ANSI): pos=6, OK
3.txt(non-ANSI): pos=12, OK
4.txt(BOM=EFBBBF,UTF-8): not tested
5.txt(BOM=FFFE,UTF-16): not tested

Conclusion

  • For GNU+CentOS, if (and only if) setlocale() is used, ftell() works perfectly. I guess that's because single-LF line-ending is standard in Unix.
  • For Windows, however, if you use either single-LF or "ccs=UTF-8" mode, ftell() will give you an inaccurate return value without warning... setlocale() shows no difference here. However BOM attached UTF-8/UTF-16 files can be handled perfectly...which means ftell() could have potential ability in handling variable-length encoding??
    Finally, just as mentioned before, "r" mode (w. compliance to CR+LF line-ending rule) will "save the world".~

@Hans Passant, @n.m., please kindly amend the conclusion if i missed anything.

tiancheng
  • 21
  • 3
  • I assume all the `_t*` and `_f*` stuff is some Microsoft invention? – DevSolar Mar 24 '15 at 10:07
  • 3
    The value returned by ftell() on a text stream cannot be accurate. It uses an internal buffer to deal with the variable-length encoding. So you'll get too large a value, some of the bytes were already read and are stored in the buffer. – Hans Passant Mar 24 '15 at 10:19
  • @ DevSolar: bingo, `_t*`, `_f*` are MS's convention..~ defined as macro in a `tchar.h` for accommodation with single-char and multi-char env. with `_UNICODE` defined, `_wfopen_s()` and `fgetws()` will be used actually. I will update the question later – tiancheng Mar 24 '15 at 10:26
  • Windows ftell is known to be flaky in utf8 mode. Are you using the newest incarnation of the toolchain? https://connect.microsoft.com/VisualStudio/feedback/details/591030 – n. m. could be an AI Mar 24 '15 at 10:34
  • 1
    @Hans Passant: Thanks! so you just explained why it worked in `"r"` mode (static encoding) but failed in `"r,css=UTF-8"` mode (variable-length encoding). Is relevant source code included in the `ftell.c` or elsewhere? – tiancheng Mar 24 '15 at 10:38
  • @HansPassant even in text mode, fseek'ing to a previously ftell'd position should work as expected. – n. m. could be an AI Mar 24 '15 at 10:40
  • @n.m. thanks for your experienced viewpoint! I'm using _VStudio 2012_ which should be the upgraded one. maybe try it on MinGW tomorrow.. – tiancheng Mar 24 '15 at 10:58
  • 3
    No, the ccs attribute makes no difference. The same buffer is also used to deal with line-endings, \n vs \r\n. Your text file is not a normal Windows text file produced by, say, Notepad. – Hans Passant Mar 24 '15 at 11:01
  • 1
    I guess that's what you get for using non-standard line endings (see [Why is the line terminator CR+LF?](http://blogs.msdn.com/b/oldnewthing/archive/2004/03/18/91899.aspx)). – IInspectable Mar 24 '15 at 11:03
  • For me, `fseek` in a similar program fails. – n. m. could be an AI Mar 24 '15 at 19:48

0 Answers0