2

I am extracting data from a XML using the axiom.
But I'm getting above error because of having CTRL-CHAR (eg : â, €, ¢, “, ”, ™, ’, – etc) in the XML.
Can any body help me to replace all the CTRL-SHARs to avoid the above error.

ironwood
  • 8,936
  • 15
  • 65
  • 114
  • 1
    The CTLR-CHAR doesn't refer to those characters you've listed, but to non-printable control characters below U+0020 which (with a few exceptions, notably CR, LF and tab) are not allowed in XML 1.0 documents. If your source documents contain such characters then they're not well-formed XML. – Ian Roberts Sep 14 '12 at 12:03
  • @ Ian : Yep, but the exceptions said them as the CTRL-CHAR isn't it? When I simply replace the detected caharacters one after another it works fine. But I need a handy and robust method for this. – ironwood Sep 14 '12 at 12:07
  • The exception says "code 15", i.e. U+000F. – Ian Roberts Sep 14 '12 at 12:52

1 Answers1

0

Currently I'm using following method in this case. But I think there must be a better way than this.

public static String removeNonUtf8CompliantCharacters( final String inString ) {
        if (null == inString ) return null;
        byte[] byteArr = inString.getBytes();
        for ( int i=0; i < byteArr.length; i++ ) {
            byte ch= byteArr[i]; 
            // remove any characters outside the valid UTF-8 range as well as all control characters
            if ( !(ch < 0x00FD && ch > 0x001F) || ch =='&' || ch=='#') {
                byteArr[i]=' ';
            }
        }
        return new String( byteArr );
    }
ironwood
  • 8,936
  • 15
  • 65
  • 114