0

As part of having fun with Avro, I discovered the following:

new String(new BigDecimal("1.28").unscaledValue().toByteArray(), Charset.forName("UTF-8"))
.equals(
new String(new BigDecimal("1.29").unscaledValue().toByteArray(), Charset.forName("UTF-8")))
-> true !!!!!!!!


DatatypeConverter.printBase64Binary(new BigDecimal("1.28").unscaledValue().toByteArray())
.equals(
DatatypeConverter.printBase64Binary(new BigDecimal("1.29").unscaledValue().toByteArray()))
-> false (as expected)

but

new String(new BigDecimal("1.26").unscaledValue().toByteArray(), Charset.forName("UTF-8"))
.equals(
new String(new BigDecimal("1.27").unscaledValue().toByteArray(), Charset.forName("UTF-8")))
-> false (as expected)

Can someone explain to me what is going on? Seems like 1.27 is the cuttoff. Ideally, I need

new String(new BigDecimal("1.28").unscaledValue().toByteArray(), Charset.forName("UTF-8"))

to work for every BigDecimal value.

Stephane Maarek
  • 5,202
  • 9
  • 46
  • 87
  • Do you mind just adding the output of printing those unscaled values? (too lazy this morning to run it myself ;-) – GhostCat Apr 26 '17 at 07:08

1 Answers1

6

Can someone explain to me what is going on?

Yes, you're misusing your data. The result of BigDecimal.toByteArray() is not a UTF-8-encoded representation of a string, so you shouldn't try to convert it to a string that way.

Some different byte arrays may be "decoded" to strings via UTF-8 as the same, if they're basically invalid. If you look at the result of new BigDecimal("1.28").unscaledValue().toByteArray() and likewise for 1.29, you'll find that they're invalid, so both decode to strings containing "?". However, if you're doing this at all then you're doing it wrong.

The two byte arrays in question are { 0x00, 0x80 } and { 0x00, 0x81 }. The first byte of that will be decoded to U+0000, and the second byte of it is the start of a UTF-8-encoding of a character, but it's incomplete - so the decoder uses ?. So both strings are "\0?".

If you want to convert a BigDecimal to a string, just call toString(). If you want to represent arbitrary binary data as a string, use base64 or hex, or some similar encoding scheme designed to represent arbitrary binary data as strings. UTF-8 is designed to represent arbitrary text data as binary data.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • Thanks ! Do you see me potentially running into this issue if I use Latin 1 encoding ? It seems that my code works when I use it – Stephane Maarek Apr 26 '17 at 07:30
  • 1
    @Stephane: Yes, you'll end up with control characters that may well cause problems down the line. **This is not encoded text data. Don't treat it as if it were.** This is precisely the situation that base64 and hex are provided for. Use them. – Jon Skeet Apr 26 '17 at 07:38