String.startswith fails when comparing UTF-16 string to literal

Question

I have an Unicode ("Windows Notepad Unicode" or UTF-16LE) text file from which I read line like this:

    FileInputStream is = new FileInputStream(cmdFile);
    BufferedReader reader = new BufferedReader(new InputStreamReader(is, "UTF-16LE"));
    String line = reader.readLine();

Now I need to check whether line starts with a certain sequence of characters:

if (line.startsWith("[COMMAND]")) ...

But this returns false even if line actually "starts" with this sequence of characters.
When examining source code for startsWith I can see that comparision is done character by character. But as far as I have read, Java actually represents strings internally with this particular encoding so why comparision fails? And what is the correct way to compare in this case?
One thing that comes in mind is converting String to byte array with needed encoding and then comparing both byte arrays but that seems like a rather crude approach, is there more elegant way?

How is "[COMMAND]" string created? As written in which case that is UTF-8 or are you creating a UTF-16LE String to compare against? — Morrison Chang, Dec 23 '17 at 22:41
My code is exactly as I have written here. So you mean Android represents Strings internally as UTF-8? Does not seem like that from official documentation confirms that - https://docs.oracle.com/javase/7/docs/api/java/lang/String.html — Janeks Bergs, Dec 23 '17 at 22:44
Related: https://stackoverflow.com/a/20966894/295004 Be aware you are comparing via `startsWith` two different character sets, how should it work? — Morrison Chang, Dec 23 '17 at 22:44
http://idownvotedbecau.se/nodebugging/ --- Use debugging to see that actual string read from the file. Possible cause: The UTF-16 text file starts with a [BOM](https://en.wikipedia.org/wiki/Byte_order_mark), so the first line read will start with that. The Java `Reader` classes have no special handling of BOM. See: [Beware of Byte Order Marks](http://www.javapractices.com/topic/TopicAction.do?Id=257). — Andreas, Dec 23 '17 at 22:51
@MorrisonChang He's comparing two Strings there is no problem with encoding at that point. OP is doing something wrong or has a problem caused by BOM as mentioned by Andreas. — Oleg, Dec 23 '17 at 23:05
@Andreas Of course I did debug and I see the actual string but I don't understand why a "native" Java String (which from my understanding and from Java official documentation should already be encoded in UTF-16LE encoding) does not compare well with a String I read from file in exactly the same encoding. It is a question about Java inner workings and I hoped somebody can shed some light on it. "Thanks" for the downvote. — Janeks Bergs, Dec 23 '17 at 23:06
@MorrisonChang and why character sets are different if Java documentation states "A String represents a string in the UTF-16 format" ? This is exactly what I am reading from the file! — Janeks Bergs, Dec 23 '17 at 23:08
The only reason they don't compare, is because the two strings don't start with the same characters!!! If you look at the **actual characters** of the strings in the `line` variable using the debugger, you'll find out, *for yourself*, why they are don't compare. --- Don't know what debugger you're using. In Eclipse, you simply expand the string to see the underlying `char[]`, where you can see the characters. — Andreas, Dec 23 '17 at 23:08
*FYI:* Java Strings are arrays of `char` values. A `char` value is an UTF-16 character. There is no `LE` or `BE` about it, since that is entirely up to the JVM. — Andreas, Dec 23 '17 at 23:13
Ok, yes the problem was with BOM. Android Studio did not show those extra characters in tooltip as question marks as it usually does so that confused me. String.getBytes() revealed that. — Janeks Bergs, Dec 23 '17 at 23:16

score 0 · Answer 1 · answered Dec 23 '17 at 22:44

You could try to print out the chars of line separately as integers to check how the string is actually composed. In my App I used only BufferedReader reader = new BufferedReader(new InputStreamReader(is)); And was able to use the split method of String correctly... So maybe startsWith works properly as well.

score 0 · Accepted Answer · answered Dec 23 '17 at 23:18

0

After some research and using String.getBytes() it could be seen that problem was with byte order marks or BOMs. Android Studio did not show those extra characters in tooltip as question marks as it usually does so that confused me.

answered Dec 23 '17 at 23:18

Janeks Bergs

224
3
13

String.startswith fails when comparing UTF-16 string to literal

2 Answers2