8

I have some files generated from a script that provide information about various computers. The txt files are in UTF-8, however, there is one line that is in UTF-16 format. How should I go about reading this line from the file?

P.S. I'm trying to write a program to parse out all of these files and recompile them into one collective .csv file.

I have tried reading the file with a bufferedReader and Scanner, however this one line is the only one I am having trouble with. Most of the code I have found online for reading UTF-16 is for the entire file, which is not completely in UTF-16.

//How the line looks when opened in Notepad.

S e r i a l N u m b e r     5 C G 8 X X X X X X

//How the line looks when opened in Notepad++ with "nul" values in between each character.

S e r i a l N u m b e r     

 5 C G 8 X X X X X X

My code can pick up parts of the string, but the format of it is on multiple lines and Java doesn't recognize the characters in between each letter or number.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
Ben Combs
  • 83
  • 3
  • Are you saying a single file contains some text in UTF-16 *and* some text in UTF-8? If so, make this more obvious in both your title and the body of the Question. And point out this flaw to the publisher of the data. – Basil Bourque Jun 07 '19 at 22:30
  • Open a handle/stream/whatever to the file and read from it as UTF-8 until you reach the problematic line. Then open a second handle/stream/whatever to the same file, seek it to the byte offset of the problematic line, and read the line as UTF-16. Then seek the first handle/stream to the byte offset after the line, and continue reading the rest of the file as UTF-8. – Remy Lebeau Jun 08 '19 at 01:23

1 Answers1

4

You can try like this.

File infile = new File("/someFileInutf16.txt");
FileInputStream inputStream = new FileInputStream(infile);
 Reader in = new InputStreamReader(inputStream, "UTF-16");
Sambit
  • 7,625
  • 7
  • 34
  • 65
  • I'll give that a try. Thanks. – Ben Combs Jun 07 '19 at 16:00
  • I gave this a try, I'm sure I've messed something up, but I setup a simple test to use the code you suggested and see how it works. My output just prints a bunch of these "?" "†" characters. Any suggestions for what could be wrong? I haven't ever used the InputStreamReader before. – Ben Combs Jun 07 '19 at 18:52
  • If you are using any ide like eclipse or intellij, change the character set. – Sambit Jun 07 '19 at 18:53
  • I tried changing the character encoding to each of the option, none of them work as expected. I'm using Eclipse by the way. I went to Window -> Preferences -> General -> Workspace and change the option there as well as the default encoding option in Content Types as well. Were these the settings you mentioned? – Ben Combs Jun 07 '19 at 19:27
  • 1
    Why would this actually help? Where is the explanation? If the file is UTF-8 and only one line is UTF-16, opening a stream in UTF-16 mode will not properly the rest lines. – Fureeish Jun 07 '19 at 22:55
  • After doing some more testing, I found I was wrong about the encoding of the text. Turn's out, the entire text file is in UTF-8. I am able to read the file in its entirety, but I cannot get a String from the lines I need. The "nul" characters that show up when I open the file in Notepad++ seem to prevent me from being able to do this. I used a scanner to read the file and set it to UTF-8 encoding. I'm not sure how to capture this data, if all I am able to do with it is print it to the console. Do I need to convert the entire file to another encoding or does anyone know of another way? – Ben Combs Jun 10 '19 at 14:41