Questions tagged [cjk]

CJK stands for Chinese, Japanese and Korean and is used to label issues common to these East Asian languages and their large character repertoires.

CJK stands for Chinese, Japanese, and Korean: East-Asian languages covered by various character sets, including:

  • Big5
  • EUC-JP
  • EUC-KR
  • Shift-JIS
  • GB2312
  • GB18030
  • ISO 2022-JP
  • Unicode
1096 questions
23
votes
9 answers

How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator?

I want to split a sentence into a list of words. For English and European languages this is easy, just use split() >>> "This is a sentence.".split() ['This', 'is', 'a', 'sentence.'] But I also need to deal with sentences in languages such as…
Continuation
  • 12,722
  • 20
  • 82
  • 106
22
votes
7 answers

Convert numbered pinyin to pinyin with tone marks

Are there any scripts, libraries, or programs using Python, or BASH tools (e.g. awk, perl, sed) which can correctly convert numbered pinyin (e.g. dian4 nao3) to UTF-8 pinyin with tone marks (e.g. diàn​ nǎo)? I have found the following examples, but…
Village
  • 22,513
  • 46
  • 122
  • 163
22
votes
2 answers

Are all Kanji characters in UTF-8 3 bytes long?

Can someone please confirm that all Kanji characters in Chinese are 3 bytes long in UTF-8?
TopCoder
  • 4,206
  • 19
  • 52
  • 64
22
votes
5 answers

how do I add a font in gVim on windows system

I wanted to add a UTF-8 font in Gvim but I could not find out how to do this. I tried to follow the step on this manual but it still did not work. http://www.inter-locale.com/whitepaper/learn/learn_to_type.html (vim section halfway the page) Can…
user18383
  • 551
  • 3
  • 8
  • 12
21
votes
9 answers

How does a file with Chinese characters know how many bytes to use per character?

I have read Joel's article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" but still don't understand all the details. An example will illustrate my issues. Look at this…
Petras
  • 4,686
  • 14
  • 57
  • 89
21
votes
7 answers

How to convert Chinese characters to Pinyin

For sorting Chinese language text, I want to convert Chinese characters to Pinyin, properly separating each Chinese character and grouping successive characters together. Can you please help me in this task by providing the logic or source code for…
Ashish Yadav
  • 211
  • 1
  • 2
  • 3
20
votes
3 answers

Encoding mail subject (SMTP) in Python with non-ASCII characters

I am using Python module MimeWriter to construct a message and smtplib to send a mail constructed message is: file msg.txt: ----------------------- Content-Type: multipart/mixed; from: me to: me@abc.com subject: 主題 Content-Type:…
Rakesh
  • 271
  • 1
  • 2
  • 11
20
votes
2 answers

Regular Expression for Japanese characters

I am doing internationalization in Struts. I want to write Javascript validation for Japanese and English users. I know regular expression for English but not for Japanese users. Is it possible to write one regular expression for both the users…
Nilesh Shukla
  • 309
  • 2
  • 5
  • 12
19
votes
12 answers

What is the fastest way to the delete lines in a file which have no match in a second file?

I have two files, wordlist.txt and text.txt. The first file, wordlist.txt, contains a huge list of words in Chinese, Japanese, and Korean, e.g.: 你 你们 我 The second file, text.txt, contains long passages, e.g.: 你们要去哪里? 卡拉OK好不好? I want to create a…
Village
  • 22,513
  • 46
  • 122
  • 163
19
votes
4 answers

Prevent/workaround browser converting '\n' between lines into space (for Chinese characters)

Converting newline into space makes sense for English, for example, the following HTML:

This is a sentence.

We get the following after converting the newline into space in the browser: This is a sentence. This is good for English, but not…
cyfdecyf
  • 816
  • 2
  • 10
  • 20
19
votes
6 answers

Conversion from Simplified to Traditional Chinese

If a website is localized/internationalized with a Simplified Chinese translation... Is it possible to reliably automatically convert the text to Traditional Chinese in a high quality way? If so, is it going to be extremely high quality or just a…
philfreo
  • 41,941
  • 26
  • 128
  • 141
19
votes
6 answers

how to print chinese word in my code.. using python

This is my code: print '哈哈'.decode('gb2312').encode('utf-8') ...and it prints: SyntaxError: Non-ASCII character '\xe5' in file D:\zjm_code\a.py on line 2, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details How do I…
zjm1126
  • 63,397
  • 81
  • 173
  • 221
18
votes
4 answers

How can I detect certain Unicode characters in a string in Ruby?

Given a string in Ruby 1.8.7 (without the awesome Oniguruma regular expression engine that supports Unicode properties with \p{}), I would like to be able to determine if the string contains one or more Chinese, Japanese, or Korean characters;…
Josh Glover
  • 25,142
  • 27
  • 92
  • 129
18
votes
2 answers

Word break in languages without spaces between words (e.g., Asian)?

I'd like to make MySQL full text search work with Japanese and Chinese text, as well as any other language. The problem is that these languages and probably others do not normally have white space between words. Search is not useful when you must…
Joe Langeway
  • 300
  • 2
  • 8
18
votes
1 answer

Drawing multilingual text using PIL

I'm having trouble drawing multilingual text using PIL. Let's say I want to draw text - "ひらがな - Hiragana, 히라가나". But PIL's ImageDraw.text() function takes only one font at a time, so I cannot draw this text correctly, because it requires English,…
redism
  • 500
  • 7
  • 18
1
2
3
73 74