5

How to convert int array with UTF-8 string to StringBuilder in a while loop? For example:
int array: 71, 73, 70, 56, 57, 97, 149, 0, 55, 0, 247...
resulting string: GIF89a• €÷€ € €€ÀÜÀ¦Êð*?ª*?ÿ...
The line contains Latin, Cyrillic and Asian characters, and various symbols and numbers

do buffer.append((char)num[++i]);
while((byte)buffer.charAt(buffer.length()-1) != -1);

This method breaks down all non-Latin characters.

Dmitriy
  • 161
  • 2
  • 12

2 Answers2

3

First of all convert the int[] to a byte[] as follows:

    //intArray contains your data...
    byte[] utf8bytes = new byte[intArray.length];
    for(int i = 0; i < intArray.length; i++)
    {
        utf8bytes[i] = (byte) intArray[i];
    }

Then create a string from your bytes specifying UTF-8 as the encoding:

    String asString = new String(utf8bytes, "UTF-8");
Malcolm Smith
  • 3,540
  • 25
  • 29
  • Is int contains 1 byte instead of 4? – Dmitriy Jun 07 '12 at 20:47
  • From your (admittedly small), selection of example values it looked like you were dealing with an array of ints < 256, and therefore easily castable into bytes. If you did have 4 bytes packed into your ints they would mostly have very large absolute values. You could unpack them into separate bytes using bit masks and logical shifts if that was the case.... – Malcolm Smith Jun 07 '12 at 20:57
  • utf8bytes[0] = (byte)(intArray[i] >>> 24); utf8bytes[1] = (byte)(intArray[i] >>> 16); utf8bytes[2] = (byte)(intArray[i] >>> 8); utf8bytes[3] = (byte)intArray[i]; After each Latin character adds 3 space characters. After each Cyrillic character adds 2 space characters. – Dmitriy Jun 07 '12 at 21:00
0

You are reading in a GIF89a file as one integer per byte, and then printing it out as if it were a text string. The main problem is that the integers (bytes) within that file do not actually map to meaningful text characters, so where the mapping fails to render portions of the alphabet, it will render whatever your text encoding dictates (which looks to me like a lot of garbage).

Graphical information does not always map cleanly to text. While there are 256 possible byte values, and sometimes one or more bytes will represent a single character, there are only 26 letters in the English alphabet, which are represented in upper and lower case. Along with the ten digits and a handful of punctuation, you get about 80 different characters which are in common use in an essay. The rest of the 160+ characters are control codes, signals to use multi-bytes, or mappings to characters present to support display of foreign languages.

That garbage is the closest thing to the valid bytes to characters mapping for your current character set. If you want a better output, then try reading a file that contains data which maps to something character related.

Edwin Buck
  • 69,361
  • 7
  • 100
  • 138
  • 1
    No, this is just an example, the program is not designed for reading files. The program will work with text messages in Russian and Asian languages – Dmitriy Jun 07 '12 at 20:33