2

I am sorry if it has been asked before. I am trying to process a text file using Java. The text file is exported from MS SQLServer. When I open it in PSPad (sort of text editor in which I can view any file in hex format), it tells me that my text file is in UTF-16LE. Since I am getting it from someone else, it is quite possible.

Now my Java program is not able to deal with that format. So I wanted to know if there is any way by which I can either convert my text file in ASCII format or do some preprocessing or anything? I CAN modify the file.

Any help is greatly appreciated.

Thanks.

EDIT 1

I wrote this program, but it is not working as expected. If I see the output file in PSPad, I can see each character as a 2-byte char, e.g. '2' is 3200 instead of just 32; 'M' is 4D00 instead of just 4D, etc. The though says the encoding of output file is UTF-8. I am kind of confused here. Can anyone tell me what am I doing wrong?

public static void main(String[] args) throws Exception {

        try {
            // Open the file that is the first
            // command line parameter
            FileInputStream fstream = new FileInputStream(
                    "input.txt");
            // Get the object of DataInputStream
            DataInputStream in = new DataInputStream(fstream);
            BufferedReader br = new BufferedReader(new InputStreamReader(in,"UTF-16LE"));
            String strLine;
            // Read File Line By Line
            while ((strLine = br.readLine()) != null) {
                // Write to the file
                writeToFile(strLine);
            }
            // Close the input stream
            in.close();
        } catch (Exception e) {// Catch exception if any
            System.err.println("Error: " + e.getMessage());
        }

        System.out.println("done.");
    }

    static public void writeToFile(String str) {
        try {
            OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream("output.txt", true), "UTF-8");
            BufferedWriter fbw = new BufferedWriter(writer);
            fbw.write(str);
            fbw.close();
        } catch (Exception e) {// Catch exception if any
            System.err.println("Error: " + e.getMessage());
        }
    } 

EDIT 2

Here are the snapshots:

input file in PSPad (a free hex viewer)enter image description here

output file in PSPad enter image description here

this is what i was expecting to see: enter image description here

Bhushan
  • 18,329
  • 31
  • 104
  • 137
  • java text is utf 16. http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp#text-representation so may be doing some thing wrong in your code – Vivek Goel May 31 '11 at 17:58
  • Show us your code and let us know how you try to process the file. – Thor May 31 '11 at 18:04
  • @Thor: The code is quite big, so cannot post. But what I am doing is this: It is a simple comma-delimited text file. I am extracting some fields in putting them in my database. Before that, I am processing some fields with, e.g. with SimpleDateFormat, which is having trouble with UTF-16LE. – Bhushan May 31 '11 at 18:47
  • Define "not working": what happens when you execute it? – Joachim Sauer Jun 01 '11 at 13:22
  • Also: there's no reason to create a `DataInputStream` in your code, simply leave out that part and use the `FileInputStream` directly. – Joachim Sauer Jun 01 '11 at 13:22
  • @Joachim Sauer: Sorry for the incomplete description. I just edited the initial part of my edit, please have a look. What I am expecting is each char as a 1-byte char and not a 2-byte. E.g.g '2' should be seen as '32' in hex-view mode and not '3200'. – Bhushan Jun 01 '11 at 13:28
  • @Joachim Sauer: removed `DataInputStream`, still same problem. – Bhushan Jun 01 '11 at 13:32
  • @Bhushan: your code isn't perfect (for example, there's no need to continuously re-open the output file), but it should mostly work. Are you sure that it's not your editor/viewer that's to blame? Can you compare the **size** of `input.txt` and `output.txt` after running your code? (and don't forget to delete `output.txt` before running it, because you'll only ever append to it). – Joachim Sauer Jun 01 '11 at 13:36
  • @Joachim Sauer: good point about the `size`. I have already noticed that the size of output file is nearly half of input file. So even I thought that its working. But if I see it in the editor, its not what I was expecting. – Bhushan Jun 01 '11 at 14:04
  • 1
    @Bhushan: If the size is roughly half of the input, then it's almost certainly correctly UTF-8 encoded. Maybe your editor automatically does the conversion to UTF-16 internally (many text editors only support a single internal format). – Joachim Sauer Jun 01 '11 at 14:05
  • @Joachim Sauer: Another important thing. If I supply my `input file` as input to my program which is doing the field extraction and db insertion, it fails. But if I give output file of the above program as input, it works fine! This is totally unexpected. So basically somehow this program is working, but I am still not sure how. – Bhushan Jun 01 '11 at 14:07
  • 1
    @Bhushan: how is that unexpected? Your program expects UTF-8 encoded text. If you pass in UTF-8 encoded text, it works. If you pass in non-UTF-8 encoded text it doesn't work. – Joachim Sauer Jun 01 '11 at 14:09
  • @Joachim Sauer: What you are saying is correct. But if I view the output file in the editor, it looks exactly like the input file. I am saying this wasn't expected. I will attach the snapshots in my question. – Bhushan Jun 01 '11 at 14:43
  • 1
    @Bhushan: yes, but that's the "fault" of the editor: it autodetects the encoding of the text file (a dangerous operation, by the way). – Joachim Sauer Jun 01 '11 at 14:48
  • @Joachim Sauer: I see your point. May be you are right. Thanks a lot for the help! I really appreciate it! – Bhushan Jun 01 '11 at 15:22

3 Answers3

6

Create an InputStreamReader for charset UTF-16LE and you will be all set.

bmargulies
  • 97,814
  • 39
  • 186
  • 310
1

InputStreamReader will let you load your UTF-16EL in memory. You can then perform all string manipulations you need. Then, you can save into ASCII format using OutputStreamWriter. Use CharSet to select formats.

Jérôme Verstrynge
  • 57,710
  • 92
  • 283
  • 453
0

Just found a solution.

http://www.fileformat.info/convert/text/utf2utf.htm

Lets you upload and convert between the encodings.

Its not a permanent solution though, since my file is 700MB+. So I will try out some solutions posted by others.

This small software helps:

http://www.kalytta.com/tools.php

Bhushan
  • 18,329
  • 31
  • 104
  • 137