Questions tagged [cjk]

CJK stands for Chinese, Japanese and Korean and is used to label issues common to these East Asian languages and their large character repertoires.

CJK stands for Chinese, Japanese, and Korean: East-Asian languages covered by various character sets, including:

  • Big5
  • EUC-JP
  • EUC-KR
  • Shift-JIS
  • GB2312
  • GB18030
  • ISO 2022-JP
  • Unicode
1096 questions
7
votes
3 answers

How do I format Chinese characters so they fit the columns?

I am trying to print some information in a column-oriented way. Everything works well for Latin characters, but when Chinese characters are printed, the columns stop being aligned. Let's consider an example: var latinPresentation1 = "some…
artsch
  • 225
  • 2
  • 10
7
votes
2 answers

Korean, Thai and Indonesian POS tagger

Can someone recommend an open source POS tagger for Korean, Indonesian, Thai and Vietnamese? That I can use to tag the corpus data that I currently have. (e.g. the stanford-postagger) If you are a dev and care to share and let me test out the POS…
alvas
  • 115,346
  • 109
  • 446
  • 738
7
votes
2 answers

Word wrap algorithms for Japanese

In a recent web application I built, I was pleasantly surprised when one of our users decided to use it to create something entirely in Japanese. However, the text was wrapped strangely and awkwardly. Apparently browsers don't cope with wrapping…
Breton
  • 15,401
  • 3
  • 59
  • 76
7
votes
7 answers

Japanese ASCII Code

Where can I get a list of ASCII codes corresponding to Japanese kanji, hiragana and katakana characters. I am doing a java function and Javascript which determines wether it is a Japanese character. What is its range in the ASCII code?
cedric
  • 3,107
  • 15
  • 54
  • 65
7
votes
2 answers

Understanding Python Unicode and Linux terminal

I have a Python script that writes some strings with UTF-8 encoding. In my script I am using mainly the str() function to cast to string. It looks like that: mystring="this is unicode string:"+japanesevalues[1] #japanesevalues is a list of unicode…
Cesc
  • 648
  • 1
  • 11
  • 22
7
votes
3 answers

Detecting CJK characters in a string (C#)

I am using iTextSharp to generate a series of PDFs, using Open Sans as the default font. On occasion, names are inserted into the content of the PDFs. However my issue is that some of the names I need to insert contain CJK characters (stored in…
user1961026
7
votes
2 answers

Manipulating utf8mb4 data from MySQL with PHP

This is probably something simple. I swear I've been looking online for the answer and haven't found it. Since my particular case is a little atypical I finally decided to ask here. I have a few tables in MySQL that I'm using for a Chinese language…
Yhilan
  • 269
  • 1
  • 3
  • 15
7
votes
3 answers

Get the number of bytes needed for a Unicode string

I have a Korean string encoded as Unicode like u'정정'. How do I know how many bytes are needed to represent this string? I need to know the exact byte count since I'm using the string for iOS push notification and it has a limit on the size of the…
jasondinh
  • 918
  • 7
  • 21
7
votes
2 answers

How to get the length of Japanese characters in Javascript?

I have an ASP Classic page with SHIFT_JIS charset. The meta tag under the page's head section is like this: My page has a text box (txtName) that should only allow 200…
mark uy
  • 521
  • 1
  • 6
  • 17
7
votes
3 answers

n-gram name analysis in non-english languages (CJK, etc)

I'm working on deduping a database of people. For a first pass, I'm following a basic 2-step process to avoid an O(n^2) operation over the whole database, as described in the literature. First, I "block"- iterate over the whole dataset, and bin each…
Matt Luongo
  • 14,371
  • 6
  • 53
  • 64
6
votes
2 answers

Allowing Simplified Chinese Input

The company I work for is bidding on a project that will require our eCommerce solution to accept simplified Chinese input. After doing a bit of research, it seems that ASP.net makes globalization configuration easy:
James Hill
  • 60,353
  • 20
  • 145
  • 161
6
votes
1 answer

Perl regex find character from arbitrary set

I have a file with Korean and chinese characters. I want to find pairs where parenthetical statements are used to give the hanja for a Korean word, like this: 한문 (漢文) The search would look something like this: /[korean characters] \([chinese…
Nate Glenn
  • 6,455
  • 8
  • 52
  • 95
6
votes
1 answer

Faker Python generating chinese/pinyin names

I am trying to generate random chinese names using Faker (Python), but it generates the names in chinese characters instead of pinyin. I found this : and it show that it generates them in pinyin, while when I try the same code, it gives me only…
Armonia
  • 77
  • 5
6
votes
2 answers

Converting zenkaku characters to hankaku and vice-versa in C#

As it says in the header line, I want to convert zenkaku characters to hankaku ones and vice-vrsa in C#, but can't figure out how to do it. So, say "ラーメン" to "ラーメン" and the other way around. Would it be possible to write this in a method which…
yu_ominae
  • 2,975
  • 6
  • 39
  • 76
6
votes
2 answers

How to make beautiful line breaks in Japanese?

I have a website in English and Japanese. English is displayed perfectly. There are problems with hyphenation in Japanese. Sometimes hanging 1-2 characters remain on a new line. I want to manage the hyphenation and put it where I need to. I split…
VoidArray
  • 185
  • 1
  • 11