2

So im using a scanner to read a file. However i dont understand that if the file is a UTF-8 file, and the current line being read when iterating over the file, is containing a digit, the method Character.isDigit(line.charAt(0)) returns false. However if the file is not a UTF-8 file the method returns true.

Heres some code

File theFile = new File(pathToFile);
Scanner fileContent = new Scanner(new FileInputStream(theFile), "UTF-8");
while(fileContent.hasNextLine())
{
    String line = fileContent.nextLine();
    if(Character.isDigit(line.charAt(0)))
    {
         //When the file being read from is NOT a UTF-8 file, we get down here
    }

When using the debugger and looking at the line String, i can see that in both cases (UTF-8 file or not) the string seems to hold the same, a digit. Why is this happening?

David Conrad
  • 15,432
  • 2
  • 42
  • 54
Daniel Jørgensen
  • 1,183
  • 2
  • 19
  • 42
  • Did you debug it? What character does `line.charAt(0)` return when it's not doing what you expect? – Jesper Mar 04 '15 at 20:47
  • `line.charAt(0)` does not return anything when using a UTF-8 file, which explains why it Character.isDigit does not return true ofcourse. But why does `line.charAt(0)` not return anything? – Daniel Jørgensen Mar 04 '15 at 20:52
  • It is not happening. That is, the `String` you obtain from reading one file is not the same as the one you obtain from reading the other. Once you get the line into `String` form, Java doesn't know or care where the `char`s in it came from. When you debug, look at the integer values of the `char`s, not a graphic representation of them, and certainly not a graphic representation of the whole string. – John Bollinger Mar 04 '15 at 20:52
  • `String.charAt(0)` can fail to return anything only by throwing an exception. I'm having trouble imagining how that could be consistent with the strings appearing to hold the same thing when you look at them in a debugger. Moreover, it is not consistent with `Character.isDigit(line.charAt(0))` returning `false`, as you claim it does. – John Bollinger Mar 04 '15 at 20:55
  • Agreed, this doesn't make much sense. Out of curiosity, what is the encoding of the other, non-UTF-8 file, and what is the length of the string in both the successful and unsuccessful cases? – David Conrad Mar 04 '15 at 21:00
  • Can you give us a sample file this fails on? – David Conrad Mar 04 '15 at 21:05
  • 3
    Does your file include (accidentally) a BOM? – gregdim Mar 04 '15 at 21:06
  • Its really simple. If i have a text file with only one char in it on one line. Example content being "1" without the quotes, and save this file as both a UTF-8 encoded file and a ANSI encoded, the ANSI encoded file is the only one where `line.charAt(0)` gets 1. With the UTF-8 file however the `line.charAt(0)` returns blank. How can i check if the file has a BOM? – Daniel Jørgensen Mar 04 '15 at 21:19
  • Use a hex editor and check for leading EF BB BF – gregdim Mar 04 '15 at 21:20
  • Could you do `hexdump` on the file? Debugging encoding issues without that is almost impossible. – David Ehrmann Mar 04 '15 at 21:21
  • This is indeed the case. The UTF encoded file has a BOM where the HEX editor shows that the single char in the file is actually also three other chars [Screendump](http://gyazo.com/d07af47232055850f7f22a9dce0e0a02) Is there anything that can be done about this? – Daniel Jørgensen Mar 04 '15 at 21:25
  • "Is there anything that can be done about this?" You can strip the BOM from the file. That doesn't make it any less UTF-8 encoded, but it might interfere with some programs *recognizing* that it is UTF-8 encoded. Alternatively, your program could check for a BOM as the first character of a UTF-8 (or UTF-16BE or UTF16-LE) file, and ignore it. Perhaps it's even reasonable to ignore a BOM as the first character of *any* file. That depends on your program. – John Bollinger Mar 04 '15 at 21:38

1 Answers1

2

As finally found by exchanging comments, your file includes a BOM. This is generally not recommended for UTF-8 files because Java does not expect it and sees it as data.

So there are two options you have:

  1. if you are in control of the file, reproduce it without the BOM

  2. If not, then check the file for BOM existence and remove it before proceeding to other operations.

Here is some code to start. It rather skips than removes the BOM. Feel free to modify as you like. It was in some test utility I had written some years ago:

private static InputStream filterBOMifExists(InputStream inputStream) throws IOException {
        PushbackInputStream pushbackInputStream = new PushbackInputStream(new BufferedInputStream(inputStream), 3);
        byte[] bom = new byte[3];
        if (pushbackInputStream.read(bom) != -1) {
            if (!(bom[0] == (byte) 0xEF && bom[1] == (byte) 0xBB && bom[2] == (byte) 0xBF)) {
                pushbackInputStream.unread(bom);
            }
        }
        return pushbackInputStream;
    }
gregdim
  • 2,031
  • 13
  • 15