0

I've an input file which comes under ANSI UNIX file format. I convert that file into UTF-8.

Before converting to UTF-8, there is an special character like this in input file

»

After converting to UTF-8, it becomes like this

û

When I process my file as it is, without converting to utf-8, all special characters disappeared and data loss as well. But when I process my file after converting to UTF-8, All data appears with special character same as am getting after converting to UTF-8 in output file.

ANSI to UTF-8 (could be wrong, please correct me if am wrong somewhere)

FileInputStream = fis = new FileInputStream("inputtextfile.txt");
InputStreamReader isr = new InputStreamReader (fis, "ISO-8859-1");
Reader in = new BufferReader(isr);
FileOutputStream fos = new FileOutputStream("outputfile.txt");
OutPutStreamWriter osw = OutPutStreamWriter("fos", "UTF-8");
Writer out = new BufferedWriter(osw);

int ch;
out.write("\uFEFF";);

while ((ch = in.read()) > -1 ) {

    out.write(ch);

}

out.close();
in.close();

After this am processing my file further for final output. I'm using Talend ETL tool for creating an final output out of generated utf-8. (Java based ETL tool)

What I want is, I want to process my file so that I could get same special characters in output as am getting in input file.

I'm using java 1.8 for this whole processing. I' 'm too stuck in this situation and never dealt this with special characters.

Any suggestion would be helpful.

  • You need to read the file with UTF-8 encoding. What does your read code look like? Also, the characters after conversion to UTF-8 look wrong. How are you doing the conversion to UTF-8? – Ted Hopp May 16 '17 at 17:49
  • Could you post your code snippet for reading your ANSI encoding file? – Yohannes Gebremariam May 16 '17 at 17:54
  • @TedHopp I've posted my code of conversion – Ashish Jangra May 16 '17 at 18:00
  • 1
    If the original file is truly in ISO-8859-1, aka ANSI, then the `»` character is byte `0xBB`. That character is encoded in UTF-8 as `0xC2 0xBB`, which will display as `»`, not `û`, in ISO-8859-1. --- But you're then talking about some "processing" going wrong, without showing any part of that processing, so how do you expect us to help you figure out what is wrong with the processing, or your interpretation of the result? – Andreas May 16 '17 at 18:07
  • 1
    “After converting to UTF-8, it becomes like this: `û`” How are you observing this? What tool are you using to examine the UTF-8 file? – VGR May 16 '17 at 18:14
  • @VGR....I'm checking this in the file which is being generated after converting to UTF-8. – Ashish Jangra May 16 '17 at 18:17
  • Not sure if it's relevant, but you should not be generating a BYTE ORDER MARK character (U+FEFF) at the start of your output. That's appropriate for UTF-16, but not UTF-8. Other than that, your code looks like a correct way to transform from ISO 8859-1 to UTF-8. You aren't showing us how the file is being processed downstream, so it's hard to say what's going wrong. (The BOM shouldn't mess things up, but sometimes it does, particularly if the downstream processing is trying to auto-detect the encoding. It may be treating everything as UTF-16.) – Ted Hopp May 16 '17 at 18:26
  • @TedHopp..Tried without BOm as well. But if do like this..it gets me an output like same I get when I process file without converting to UTF-8. Means no special characters in output and loss of data as well. – Ashish Jangra May 16 '17 at 18:34

0 Answers0