The big problem I am having is when users copy things straight from WORD into fields, causing the xml I am generating to be invalid. I have found a multitude of different approaches to this problem, but what would be considered the grooviest way of removing these invalid characters, whether they come from WORD or not, from my xml?
Asked
Active
Viewed 914 times
1 Answers
2
I wrote a Java class (7 years ago now, looking at the timestamp) with a single static method to try and clean up text posted from Word.
It's here if you're interested:
/**
* <p>Title: Word Cleaner</p>
* <p>Description: Strips out all of the rubbish that Word tends to generate (open, close quotes, etc)</p>
*
* Based on John Walker's "Demoroniser" Perl script : http://www.fourmilab.ch/webtools/demoroniser/
*/
public class WordCleaner
{
private WordCleaner() {}
public static String runWordCleaner( String input )
{
StringBuffer sb = new StringBuffer() ;
int len = input == null ? 0 : input.length() ;
for( int i = 0 ; i < len ; i++ )
{
int c ;
switch( c = (int)input.charAt( i ) )
{
case 0x82 : sb.append( "," ) ; break ;
case 0x83 : sb.append( "f" ) ; break ;
case 0x84 : sb.append( ",," ) ; break ;
case 0x85 : sb.append( "..." ) ; break ;
case 0x88 : sb.append( "^" ) ; break ;
case 0x89 : sb.append( "ppt" ) ; break ;
case 0x8B : sb.append( "<" ) ; break ;
case 0x8C : sb.append( "Oe" ) ; break ;
case 0x91 : sb.append( "'" ) ; break ;
case 0x92 : sb.append( "'" ) ; break ;
case 0x93 : sb.append( "\"" ) ; break ;
case 0x94 : sb.append( "\"" ) ; break ;
case 0x95 : sb.append( "*" ) ; break ;
case 0x96 : sb.append( "-" ) ; break ;
case 0x97 : sb.append( "--" ) ; break ;
case 0x98 : sb.append( "~" ) ; break ;
case 0x99 : sb.append( "TM" ) ; break ;
case 0x9B : sb.append( ">" ) ; break ;
case 0x9C : sb.append( "oe" ) ; break ;
case 0xA9 : sb.append( "(c)" ) ; break ;
case 0xAE : sb.append( "(r)" ) ; break ;
case 0xBC : sb.append( "1/4" ) ; break ;
case 0xBD : sb.append( "1/2" ) ; break ;
case 0xBE : sb.append( "3/4" ) ; break ;
case 8208 : sb.append( "-" ) ; break ;
case 8209 : sb.append( "-" ) ; break ;
case 8211 : sb.append( "--" ) ; break ;
case 8212 : sb.append( "--" ) ; break ;
case 8213 : sb.append( "--" ) ; break ;
case 8214 : sb.append( "||" ) ; break ;
case 8215 : sb.append( "_" ) ; break ;
case 8216 : sb.append( "'" ) ; break ;
case 8217 : sb.append( "'" ) ; break ;
case 8218 : sb.append( "," ) ; break ;
case 8219 : sb.append( "'" ) ; break ;
case 8220 : sb.append( "\"" ) ; break ;
case 8221 : sb.append( "\"" ) ; break ;
case 8222 : sb.append( ",," ) ; break ;
case 8223 : sb.append( "\"" ) ; break ;
case 8226 : sb.append( "*" ) ; break ;
case 8227 : sb.append( ">" ) ; break ;
case 8228 : sb.append( "*" ) ; break ;
case 8229 : sb.append( ".." ) ; break ;
case 8230 : sb.append( "..." ) ; break ;
case 8231 : sb.append( "-" ) ; break ;
case 61514 : sb.append( ":-)" ) ; break ;
case 61515 : sb.append( ":-|" ) ; break ;
case 61516 : sb.append( ":-(" ) ; break ;
default : sb.append( (char)c ) ;
}
}
return sb.toString() ;
}
}

tim_yates
- 167,322
- 27
- 342
- 338
-
I made a few changes for this to make it a little more groovy, but all in all this was perfect. Thank you! – Howes Oct 11 '12 at 15:05