1

Ok, so I feel like this question for asked many times but I am not able to find an answer. I am comparing two different files that were generated by two different programs. Of course both programs are generating the files from the same db queries. I am running into the following differences:

s1 = Samsung - Mobile USB Chargers

vs.

s2 = Samsung \u2013 Mobile USB Chargers

How do I convert s2 to s1 or even better, how do I compare the two without getting a difference? Someone somewhere on the wide wide internets mentioned to use ApacheCommons-lang's StringUtils class, but I couldn't find anything useful.

Mohamed Nuur
  • 5,536
  • 6
  • 39
  • 55
  • 1
    Note that the first string has an ASCII hyphen (HYPHEN-MINUS), while the second has an EN-DASH. – ninjalj May 18 '11 at 22:15
  • Hmm, so what you're saying is the two strings can't be compared in any easy way other than doing some sort of lookup table? – Mohamed Nuur May 18 '11 at 23:31

2 Answers2

2

You could fold all the characters with the Dash_Punctuation property.

This code will print true:

boolean equal = "Samsung \u2013 Mobile USB Chargers"
                    .replaceAll("\\p{Pd}", "-")
                    .equals("Samsung - Mobile USB Chargers");
System.out.println(equal);

Note that this will apply to all characters with that property (like 〰 U+3030 WAVY DASH). A comprehensive list of characters with the Dash_Punctuation (Pd) property are in UnicodeData.txt. Java 6 supports Unicode 4. See chapter 6 for a discussion of punctuation.

McDowell
  • 107,573
  • 31
  • 204
  • 267
  • Very interesting. I think this brings me closer to my answer and I'll keep doing research. For now, I'll give you the accepted answer and read this unicode link you shared. – Mohamed Nuur May 19 '11 at 00:01
  • @Mohamed Nuur - I've made some corrections to my post; some dash characters mentioned in Chapter 6 (like TILDE U+007E) do not have the Pd property. – McDowell May 19 '11 at 00:14
1

The program that generated the first string is writing the file in ASCII, using a character substitution fallback mechanism. The second is writing the file in Unicode.

These could be compared by making a copy of the second file in ASCII using the same fallback mechanism.

The best solution would be to modify the first program so that it also uses Unicode.

(It is possible that the second file was using something other than Unicode, since some other character sets include the en dash. If so, then the best solution is to write both files in Unicode, if possible.)

Jeffrey L Whitledge
  • 58,241
  • 9
  • 71
  • 99
  • it's not possible to change any of the outputs. Yes, one is written in unicode, while the other is ascii i believe. although i'm not 100% sure. basically one is legacy c++ app while the other is java app. so we noticed many changes due to unicode characters/code points. what is the best way to ignore these changes? – Mohamed Nuur May 18 '11 at 23:38