0

What kinds of challenges are there facing automatic hyphenation? It seems that you could just draw word by word, breaking when the length of the line exceeds the length of the viewport (or whatever we're wrapping our text in), placing hyphens after as many characters as can fit (provided at least two characters fit and the word is at least four characters), skipping words that already contain a hyphen (there's no requirement that words have to be hyphenated).

But I note how Firefox and IE need a dictionary to be able to hyphenate with CSS's hyphens. This seems to imply that there are further issues regarding where we can place hyphens.

What kinds of issues are these? Do any exist in the English language or do they only exist in other languages?

Kat
  • 4,645
  • 4
  • 29
  • 81
  • 1
    hyphens may *not* be placed arbitrarily in proper text (as opposed to tweets, quick emails, etc). They should be placed between syllables and in such a way as to not leave too few letters from the word on either line. The dictionaries provide syllable breaks. – mpez0 Jan 01 '15 at 00:08
  • Also, the greedy algorithm may not produce optimal results. For example, if there is a long unbreakable word, you may find that you get more even line lengths if you intentionally break some lines early. – Raymond Chen Jan 01 '15 at 01:52

1 Answers1

0

You have these issues in all languages. You can only place a hyphen where meaningful tokens result from the split, as has already been pointed out. You don't want to, for example, split a word like "wr-ong".

This may or may not be a syllable, while in most languages (including English) it is. But the main point is that you cannot pin it down as easily just with some simple rules. You would need to consider a lot of phonology to get a highly accurate result, and these rules vary from language to language.

With this background, I can see why one would take a dictionary instead, and frankly, being a computational linguist myself, this is also what I would probably opt for.

If you DO want to go for an automatic solution, I would recommend doing some research in English phonology of syllables, or the so-called syllabification. You might want to start with this article on Wikipedia:

Wikipedia - Syllabification

ling_jan
  • 99
  • 4