9

In text field if i copy from word , junk character get inserted. While posting parameters from jsp page it remains fine. But while getting the parameter in java it converts into junk. I have used the following code to eliminate junk before insertion. I am using mysql database. (JBOSS 5.1 GA server)

String outputEncoding = "UTF-8";

Charset charsetOutput = Charset.forName(outputEncoding);
CharsetEncoder encoder = charsetOutput.newEncoder();
byte[] bufferToConvert = userText.getBytes();
CharsetDecoder decoder =  (CharsetDecoder) charsetOutput.newDecoder();
try {
    CharBuffer cbuf = decoder.decode(ByteBuffer.wrap(bufferToConvert));
    ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(cbuf));
    userText = decoder.decode(bbuf).toString();
} catch (CharacterCodingException e) {
    e.printStackTrace();
}

but I am still getting junk character for single quote('') and double quotes(""). I need the string in UTF-8. Can anyone suggest where i may be wrong?

Example: Input -"esgh”. output - â??esghâ?? : Wanted Output - "esgh”.

  • 5
    Can you give a few examples of input and wanted output? – Keppil Jul 24 '12 at 10:03
  • I have given one example. but it happens for single quote as well. –  Jul 24 '12 at 10:09
  • Couldn't you just filter by ASCII values? Just take everything greater than 31 and less than 128. – Rosdi Kasim Jul 24 '12 at 10:10
  • 1
    Your `inputDecoder` variable is not used in your code sample. Is this intentional or a mistake? I would have thought you would obtain a `Charset` instance for this input character type and use that instead of the decoder your obtain from the output character set. – Duncan Jones Jul 24 '12 at 10:11
  • @DuncanJones It does not make any difference. Anyways I was trying something else. So wrongly posted. –  Jul 24 '12 at 10:13
  • Are you sure you're reading it back as UTF8? Often the issue is you make sure it's correctly encoded, then just read it out of the database expecting it to be correct, but you've somehow read it as some other encoding (I'm not sure MySQL tells the client what the encoding is, but a lot of software seems to display it incorrectly even though it's stored correctly in UTF8). – Vala Jul 24 '12 at 10:13
  • @Thor84no yes, I am sure. Anyways it is stored as junk in database. –  Jul 24 '12 at 10:15
  • Well it looks from your output like `"` is converted to `â??`, can you look at the actual bytes for that since the latin printing of it is rather useless. – Vala Jul 24 '12 at 10:24
  • yes. the actual bytes shows -38 0 -98 0 -98 0 for the junk –  Jul 25 '12 at 12:34
  • It's not clear what you mean by "While posting parameters from jsp page it remains fine". How are you determining that it's invalid in Java? Just because you've got problems in the database doesn't mean the value is incorrect in Java. – Jon Skeet Jul 30 '12 at 19:47
  • @JonSkeet I tried to post into php, its coming fine, the problem comes in when the parameters are received in Java –  Aug 01 '12 at 07:14
  • What operating system are you running? Do you know the default character encoding on your system? – erickson Aug 04 '12 at 16:51
  • I'm confident that your problem is that somewhere along the line, a string is encoded to bytes with UTF-8, then decoded incorrectly. However, your question is so unclear that it's impossible to tell where that is happening. In the code above, where did the value in `userText` come from? Is it input from a web browser? If so, what server are you using and was the request a GET or a POST? Or, is this happening after the value been retrieved from your MySQL database? Post the code that you used to find the information that led you to say, "the actual bytes shows -38 0 -98 0 -98 0 for the junk." – erickson Aug 04 '12 at 17:15

4 Answers4

5

You have to swap around the encode and decode calls. Plus; you are decoding twice, for only one encoding!

You wrote:

CharBuffer cbuf = decoder.decode(ByteBuffer.wrap(bufferToConvert));
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(cbuf));
userText = decoder.decode(bbuf).toString();

But, obviously, it has to be:

ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(userText));
CharBuffer cbuf = decoder.decode(bbuf);
userText = cbuf.toString();

First, you have to encode your text, then decode the encoded result.

Martijn Courteaux
  • 67,591
  • 47
  • 198
  • 287
  • But the encode method is not applicable for argument ByteBuffer. It throws error. –  Jul 24 '12 at 10:38
  • Oh, yes, you are right. Swap the buffers as well. Check out my edited answer. – Martijn Courteaux Jul 24 '12 at 10:40
  • Thanks for the effort. but still it does not remove junk. I am getting the same result as before. –  Jul 24 '12 at 10:53
  • 2
    What are you trying to achieve here? If the encoder is the platform default (say, ISO-8859-1) and the decoder is UTF-8, your solution is definitely going to corrupt the text. And if you are lucky, the platform default is UTF-8, and this will do absolutely nothing. – erickson Aug 04 '12 at 16:33
1

If you copy text from Microsoft Word, it has the 'Smart Quotes' feature that can and will trip up sometimes when encoding/decoding. Try using encoding Windows-1252 as source encoding. Also, I would suggest using String#getBytes(String) and String#String(byte[],Charset) for the conversions, no need to mess with buffers at this level.

Tassos Bassoukos
  • 16,017
  • 2
  • 36
  • 40
0

The answer by Martijn Courteaux should give you the expected output. But once try with the server setup CHARACTER and COLLATION .Set to UTF-8.

I hope it will work.

JDGuide
  • 6,239
  • 12
  • 46
  • 64
0

Please check if web/application server is sending the correct data.

Which web/application server are you using?

Are you using a simple text field or any other?

Satish Pandey
  • 1,184
  • 4
  • 12
  • 32