7

In a recent web application I built, I was pleasantly surprised when one of our users decided to use it to create something entirely in Japanese. However, the text was wrapped strangely and awkwardly. Apparently browsers don't cope with wrapping Japanese text very well, probably because it contains few spaces, as each character forms a whole word. However, that's not really a safe assumption to make as some words are constructed of several characters, and it is not safe to break some character groups into different lines.

Googling around hasn't really helped me understand the problem any better. It seems to me like one would need a dictionary of unbreakable patterns, and assume that everywhere else is safe to break. But I fear I don't know enough about Japanese to really know all the words, which I understand from some of my searching, are quite complicated.

How would you approach this problem? Are there any libraries or algorithms you are aware of that already exist that deal with this in a satisfactory way?

dda
  • 6,030
  • 2
  • 25
  • 34
Breton
  • 15,401
  • 3
  • 59
  • 76
  • 1
    exact duplicate http://stackoverflow.com/questions/1605353/how-does-one-word-break-languages-without-spaces-between-words-like-asian-langua – Breton Jan 19 '10 at 00:56
  • 1
    I think you can't word wrap Japanese without understanding the words so what you'll need at a minimum is a Japanese dictionary. I couldn't tell you how hard that would be though or if there'd be any ambiguity (meaning the correct word depends on context, which will complicate it greatly). – cletus Jan 19 '10 at 00:56
  • 2
    not really duplicate - that question is about breaking text into words for the purpose of indexing. That's a hard problem. Fortunately, it can be largely ignored when wrapping for layout. – Michael Borgwardt Jan 19 '10 at 01:00

2 Answers2

12

Japanese word wrap rules are called kinsoku shori and are surprisingly simple. They're actually mostly concerned with punctuation characters and do not try to keep words unbroken at all.

I just checked with a Japanese novel and indeed, both words in the syllabic kana script and those consisting of multiple Chinese ideograms are wrapped mid-word with impunity.

Michael Borgwardt
  • 342,105
  • 78
  • 482
  • 720
  • I just tried that on Yahoo Japan. It seems that Firefox implements kinsoku shori. I could not get a line to start with a closing bracket (that is all I checked). With Safari, I could. – Thilo Jan 19 '10 at 01:06
  • As per the comment form @Michael, I found that the wrapping rules are different for the case Japanese language. As I am facing an issue on wrapping Japanese content while rendering html into PDF, Is there any way to wrap Japanese using CSS or any other ways? – lambypie Jan 08 '14 at 11:50
  • Just to clarify, @Michael is not saying that line breaking is a free-for-all. There are specific rules for how to perform line breaks, and the cases in which line breaks are not allowed. The Wikipedia article that he links to is very helpful. – mercurytw Dec 02 '14 at 19:37
0

Below listed projects are useful to resolve Japanese wordwrap (or wordbreak from another point of view).

mikan has regex-based approach while budou uses natural language processing.

Youngjae
  • 24,352
  • 18
  • 113
  • 198