2

I'm attempting to parse a UTF 16 encoded JSON file, however I've run into a weird issue.

Whenever I use a FileInputStream, parsing the file seems to start at the midpoint. For example, if the file is 40 characters long, it will begin at character 20. This causes errors with parsing the JSON, as obviously its data begins at character 0 in the file.

This issue cropped up the other day, despite working for weeks. I can see no issue with my code as it wasn't changed in the days leading up to the issue starting.

One of my attempted workarounds was to switch to using a FileReader. It begins normally at character zero, however it cannot handle the UTF-16 characters in the document, so does not solve the problem.

I am using Google's Gson library for handling JSON, however I think the issue lies somewhere within the InputStreamReader, or with FileInputStream.

Below is the code which is at issue;

JsonReader reader = new JsonReader(new InputStreamReader(new FileInputStream(file), "UTF-16"));
reader.beginArray();
...

Here is the error it throws. The line reader.beginArray(); above causes the exception.

java.lang.IllegalStateException: Expected BEGIN_ARRAY but was STRING at line 1 column 21
    at com.google.gson.stream.JsonReader.expect(JsonReader.java:337)
    at com.google.gson.stream.JsonReader.beginArray(JsonReader.java:304)
    at reader.ProofDatabase.load(ProofDatabase.java:130)
    ...

And here is my partial workaround which does not handle UTF16 strings

JsonReader reader = new JsonReader(new FileReader(file));
reader.beginArray();
...

Any solution, be it a fix to the original problem, or another method of reading in the file as UTF-16 would be more than welcome.

Perception
  • 79,279
  • 19
  • 185
  • 195
Chris Salij
  • 3,096
  • 5
  • 26
  • 43
  • Does the JSON file begin with a Byte Order Marker? How is it generated? – Joni Feb 17 '12 at 16:25
  • The file I'm using currently was created by hand. As in I manually typed in the valid json. The previous file was generated using the `JsonWriter` class in the gson library. – Chris Salij Feb 17 '12 at 16:48
  • What editor did you use? – Joni Feb 17 '12 at 16:50
  • Could you post a hex dump of the file? On OSX you can use the command `hexdump -C yourfile`. The reason I ask is because it's possible the editor is adding some extra characters that confuse the json library, however I can't find any documentation on how TM works with utf16.. – Joni Feb 17 '12 at 20:10
  • In your first example you set the Charset to UTF-16 whereas in the second one you rely on the default encoding. Maybe both your file and the default encoding is UTF-8 or any 8 bit encoding. Have a look to your file within an hex file viewer. – Pierre Feb 28 '12 at 09:03

1 Answers1

0

Forgot to update the question when I found the solution.

The error came from the fact that I manually created the JSON file rather than programmatically generating it.

When the file was generated by the JSONWriter class, extra meta-data was added to the file that told the parser that it was a JSON file. This meta-data was missing in the manually created file, so JSONReader was throwing errors in parsing the file.

Chris Salij
  • 3,096
  • 5
  • 26
  • 43