0

I have API that produces results in specific single-byte charset (WIN 1257) and I am reading this result in Kotlin as:

val connection = URL("http://192.168.1.21:92/someAPI").openConnection() as HttpURLConnection
var byteArray: ByteArray = ByteArray(10000000)
connection.inputStream.read(byteArray)
val tmp = String(byteArray, Charsets.UTF_8).trim()

Of course, this is clearly incorrect code, because it presumes that byteArray is the representation of the string that is encoded in UTF-8. It may be desirable to correct this code by using Charsets.WIN_1257, but there is no such option in Kotlin. My byte array is the representation of the string that is WIN-1257 encoded - how can I get UTF-8 string?

Here is simple test code that isolates my problem and that can be run in https://play.kotlinlang.org:

/**
 * You can edit, run, and share this code.
 * play.kotlinlang.org
 */
fun main() {
    var byteArray: ByteArray = listOf(0xe2, 0x72).map { it.toByte() }.toByteArray()
    println(String(byteArray, Charsets.UTF_8))
}

On can se that UTF_8 produces the result:

�r

But I expect:

ār
TomR
  • 2,696
  • 6
  • 34
  • 87
  • 1
    Change to `Charsets.ISO_8859_1` and the output changes to `âr`. Ok, doesn't look exactly like your expected result but avoids `�` ;-) Do you have more examples of characters that turn into `�`? – deHaar Aug 03 '22 at 09:24
  • It depends on the tools. Postman automatically applies charset which outputs âr, but Kotlin applies charset which outputs � for any non-ASCII character. This is really strange, that Android Charsets structure is so poor. – TomR Aug 03 '22 at 09:29

1 Answers1

2

Look into Charset.availableCharsets; just Charset.forName("Windows-1257") might work.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • Indeed, String(byteArray, Charset.forName("windows-1257")) solved my problem. NB this works in Android Studio and its emulators, but https://play.kotlinlang.org gives "Unresolved reference: Charset". Of course, I am concerted what happens on device and not in online playground. – TomR Aug 03 '22 at 09:41
  • `Charset` is pure java, and `Charsets.UTF_8` also is a `Charset`. So weird. – Joop Eggen Aug 03 '22 at 09:45