0

I have a question related to the encoding of Japanese text.

Let us say I have a system which comprises a jvm and a database. It serves pages through an application server to client users on Internet Explorer web browser. The JVM and database use UTF-8 thoroughly. There are a number of text areas, some but not all of which make use of tinymce.

I am concerned about a situation that a Japanese user pastes some text which is not encoded in UTF-8. Is this likely to cause problems? If the user pastes text encoded in S-JIS, can it be expected to work? Early tests have not thrown any problems however I have no knowledge of the language and am concerned that special cases may exist.

Steven Im
  • 98
  • 7
daqpan
  • 7
  • 5

1 Answers1

0

Basing on http://msdn.microsoft.com/en-us/library/windows/desktop/ff729168%28v=vs.85%29.aspx:

Unicode and non-Unicode text have clipboard types of CF_UNICODE_TEXT and CF_TEXT and the operating system converts transparently between them, based on what type of data the target application requires.

If you are concerned about Japanese users, but think you can test the issue because you don't speak Japanese, you're wrong.

First of all, you can set the entire operating system to Japanese regional settings, which will set the non-Unicode encoding to Shift-JIS systemwide.

Second, you can start a single application with Japanese regional settings using AppLocale.

Third, you can test any other non-Unicode encoding, like Windows-1250/1251/1252, since the nature of conversion is practically identical.

Karol S
  • 9,028
  • 2
  • 32
  • 45
  • Thanks, that is really useful. I've understood from what you've written that if I had an internet browser (IE9) with one tab on a Shift-JIS site and one on a UTF-8 site, that the operating system should handle the cut and paste from the Shift-JIS tab to the UTF-8 one seamlessly. Is that correct? – daqpan Jul 14 '14 at 13:34
  • And by handle it I mean perform the conversion between the encodings. – daqpan Jul 14 '14 at 13:36
  • Web browsers use Unicode everywhere; better try some non-Unicode aware application, like old Notepad, old WinRAR, etc. – Karol S Jul 14 '14 at 13:38
  • Thanks for that great tip. Just to clarify what you mean by 'web browsers use Unicode everywhere' - apologies I'm not an encoding expert - are you saying that yes you could paste from a shift-jis browser tab to a utf-8 browser tab and the text you paste in would be in utf-8? My understanding is that shift-jis doesn't start off in unicode. The picture I've developed in my mind from what you've said is that the encoding in the tags simply instructs the browser how to render the character but it translates it into an actual unicode character, which allows transparent pasting to utf-8. – daqpan Jul 14 '14 at 15:35
  • There are no "shift-jis tabs" and "utf-8 tabs". Web browser, after parsing the HTML, stores all the text internally in some Unicode-capable encoding, usually UTF-16. So yes, you're guessing right. Encoding/decoding happens only when you load a page or send a request. – Karol S Jul 14 '14 at 19:49
  • Thanks your help has been invaluable. – daqpan Jul 15 '14 at 09:08