0

I have 2 Strings that are frustating me a lot. They contain, aparently, the same text, but when comparing them Java don't say that.

The text is "La Coruña". One string is returned via Google Geocoder, and the other is hardcoded by me.

I've tried equals() which returns false, equalsIgnoreCase() which returns false, contains() which returns false, compareTo() which doesn't return 0 (being 0 that are equals).

Then I dumped the strings into byte arrays with getBytes("UTF-8") method on each. Again, equals() with returns false, Arrays.compare(array1, array2) false too.

Arrays.compare() returns false when the length of each arrays are different or when a value in same position are different. So I printed both arrays and... surprise!! The content was different. Like this:
Array1 [76, 97, 32, 67, 111, 114, 117, -61, -79, 97]
Array2 [76, 97, 32, 67, 111, 114, 117, -47, -127, 97]

The question is WHY is this happening and how can make them equals so I can succesfully compare. My guess is that Google is using some kind of encoding ("La Coruña" contains ñ char) that differs from the other hardcoded String.

Please, give me some help

Thanks in advance.

Alberto
  • 367
  • 3
  • 11
  • 1
    These are fundamentally different strings, according to the ASCII. The first starts "La ", the second starts "A ". (One is the Spanish, the other is Galician.) – Oliver Charlesworth Dec 02 '14 at 23:33
  • @OliverCharlesworth Woah! You're rigth, but it was a typo copying from the error logs. Sorry! I've updated my question. Actually the content is different but length is the same. – Alberto Dec 02 '14 at 23:42
  • 1
    Still "La" vs "LA" imho. Don't know about the "ñ" though – realUser404 Dec 03 '14 at 00:01
  • @realUser404 Yes, thank you. It was a typo. I've been hours dealing with this and now I don't see that small details. I don't want to waste your time, sorry. I've used an online decimal-to-text converter and the difference is only in the **ñ** part. – Alberto Dec 03 '14 at 00:11

1 Answers1

2

The difference in the printed arrays is -61, -79 versus -47, -127 as the representation of “ñ”. The negative numbers are apparently what you get when you print bytes interpreted as signed numbers (the first bit being the sign bit). Treating them as unsigned, as bytes in character representations should be treated, they are 195, 177 vs. 209, 129 in decimal, C3, B1 vs. D1, 81 in hexadecimal. The former is the UTF-8 representation of LATIN SMALL LETTER N WITH TILDE U+00F1. The latter would make no sense as UTF-8 here, since it would be a Cyrillic letter.

Thus, the first string, apparently what you get from Google, is properly UTF-8 encoded. The other, apparently the hard-coded one, is simply in error. From the given data, it cannot be inferred where the error comes from.

Jukka K. Korpela
  • 195,524
  • 37
  • 270
  • 390
  • Thanks for pointing me to the correct direction. You were totally right. Google's string was OK, but the hardcoded wasn't. **All my source code files in my project had windows-1252 encoding**. I don't even know why. After changing (one by one) to UTF-8, the arrays became identical and I could compare them succesfully. THANK YOU. – Alberto Dec 04 '14 at 20:18