0

I have an input file in XML format and it is well formed, with accents well written. The file is created with a PHP script that works fine. But when i read the XML File and write it in another XML using a Java program, it puts strange characters instead of the characters with accents.

This is the method that reads the XML File:

public static String getArchivo(FileInputStream fileinputstream)
{
    String s = null;
    try
    {
        byte abyte0[] = new byte[1024];
        int i = fileinputstream.read(abyte0);
        if(i != -1)
        {
            s = new String(abyte0, 0, i);
            for(int j = fileinputstream.read(abyte0); j != -1; j = fileinputstream.read(abyte0))
            {
                s = s + new String(abyte0, 0, j);
            }

        }
    }
    catch(IOException ioexception)
    {
        s = null;
    }
    return s;
}

Due to the fact that the file is read byte per byte, How do i replace the "bad" bytes for the correct bytes for the accented characters? If reading files like these byte per byte is not a good idea, how can i do it better?

The characters that i need are: á, é, í, ó, ú, Á, É, Í, Ó, Ú, ñ, Ñ and °.

Thanks in advance

mrcoar
  • 146
  • 3
  • 17
  • If reading UTF (or any multibyte character encoding), the code will break no matter what, because it relies on the conversion of an arbitrarily long byte array to char, which may split a single char's multiple bytes accross several boundaries. – GPI Oct 06 '15 at 15:55
  • In that case, what is the best way to do this? – mrcoar Oct 06 '15 at 16:44
  • http://stackoverflow.com/q/28969941/2131074 – GPI Oct 06 '15 at 21:33
  • See above link, and the answer below. Usually the use of an `InputStreamReader` wrapping your `InputStream` and using the appropriate encoding is the way to go. The internals of the reader will do proper boundary detections and avoid decoding partial chars, which your current code might be doing. – GPI Oct 06 '15 at 21:40

3 Answers3

1

Probably you are reading the file with UTF-8 charset. Special chars are not part of the UTF-8 charset. Change from UTF-8 to UTF-16

Something like

InputStream in = ...
InputSource is = new InputSource(new InputStreamReader(in, "utf-16")); 

As Jordi correctly said there are no special chars outside of utf-8. So consider the first part as an information for other special chars.

Looking deeper at your code I see that you read an int and you convert it to a String. Don't convert it. Read bytes and write bytes to be sure that data will not changed.

Davide Lorenzo MARINO
  • 26,420
  • 4
  • 39
  • 56
  • *Special chars are not part of the UTF-8*.... Actually tilded vowels are not *special chars* and are contained by [UTF8 chartable](http://www.utf8-chartable.de/), AFAIK all spanish language is inside UTF-8 – Jordi Castilla Oct 06 '15 at 15:36
  • @Jordi Yes I checked and the requested chars are in the standard UTF-8, thanks for your note, I was not sure for tilde chars. – Davide Lorenzo MARINO Oct 06 '15 at 15:40
  • glad to help @Davide, I know that because I'm Spanish :) – Jordi Castilla Oct 06 '15 at 15:48
  • Yes in Italy (I am italian) we have only some accent.... no tilde or circumflexes accents :) – Davide Lorenzo MARINO Oct 06 '15 at 15:51
  • Not agreeing with this solution : the use of UTF16 is a possibility, albeit rather an unlikey one. Chances are, the file is UTF8 (or windows1252), and the OP's code breaks because the byte to char conversion is done at a random point in the inputstream. The use of a `Reader` though is a good suggestion :-). Chances are, too, if the file is valid XML and passed to an `InputSource`, the underlying XML engine will do proper encoding detection based on the prolog, which mitigates the need to know, or guess, the actual encoding. – GPI Oct 06 '15 at 21:45
1

Works for me using Chaserset ISO 8859-1. Syntax in kotlin:

val inputStream : InputStream = FileInputStream(filePath)
val json = inputStream.bufferedReader(Charsets.ISO_8859_1).use { it.readText()}
Javier Hinmel
  • 599
  • 6
  • 9
0

When you read the file use encoding utf-8 is best

BufferedReader rd = new BufferedReader(new InputStreamReader(is, "utf-8"));

In writing also use utf-8

OutputStreamWriter writer = new OutputStreamWriter( new FileOutputStream(filePath, true), "utf-8");

This worked for me.

When read file in vi editor or other editor change default encoding into utf-8

locale charmap LANG=en_US.UTF-8

Ravi Thapa
  • 79
  • 1
  • 7