Why does the Synopse hyphenation code give different results from TeX's?

Question

This question follows previous question but different. Synopse's delphi hyphenation is very fast and builts on OpenOffice libhnj library that uses TeX hyphenation.

A simple test is :

If I input 'pronunciation', the Synopse hyphenation outputs 'pro=nun=ci=ation' (4 possible hyphens or syllables). //(not 'pro=nun=ci=a=tion', 5 hyphens or syllables).

I read 2 papers (here and here) about Tex hyphenation algorithm uses in syllabification. Authors stated about 95% accuracy in syllabification. I tested Synopse hyphenation only for counting syllables on CMU Pronouncing Dictionary, but only about 53% accuracy.

Why is the result significantly different?

I reproduce my method in a little detailed way.

I parse the CMU pronuncing dictionary to compute all number of words. The CMU dic is like:

PRONOUNS  P R OW1 N AW0 N Z
PRONOVOST  P R OW0 N OW1 V OW0 S T
PRONTO  P R AA1 N T OW0
PRONUNCIATION  P R OW0 N AH2 N S IY0 EY1 SH AH0 N
PRONUNCIATION(1)  P R AH0 N AH2 N S IY0 EY1 SH AH0 N

I will have this result:

PRONOUNS=2
PRONOVOST=3
PRONTO=2
PRONUNCIATION(1)=5 // will be ignored
PRONUNCIATION=5   // use this one

Words with parentheses will be ignored when compared with the Synopse hyphenation lib. They are alternative or secondary pronunciations (variants).

Similarly, I will use the hyphenation lib to compute the number of syllables of each word in the CMU dictionary. Then I compare the two to see how many match. The words with different numbers of syllables are recorded like below:

...

94814 cmu PROMULGATED=4 | PROMULGATED=3 Synopse Hyphenation
94821 cmu PRONGER=2 | PRONGER=1 Synopse Hyphenation
94829 cmu PRONOUNCES=3 | PRONOUNCES=2 Synopse Hyphenation
94833 cmu PRONTO=2 | PRONTO=1 Synopse Hyphenation
94835 cmu PRONUNCIATION=5 | PRONUNCIATION=4 Synopse Hyphenation

...

The total line number of CMU is 123611 (excluding lines with parentheses and lines without meaningful words, like quotation mark lines '('). The total different number of syllables of the same words for the two: 57870.

CMU may not be the standard of syllable numbers. In this test, (123611-57870)/123611=53.183%. This is significantly different from the accuracy rate stated by the author in their paper above. Of course, they used a another database (CELEX) for their tests. Why is the result so different?

The Synopse hyphenation library is very fast. I want to know further if this is due to the pattern file (dic file used for hyphenation originally from libhnj used in OpenOffice). Or did the author of the paper use a different dictionary file?

There are too many questions here in a single post. SO is designed for a single question at a time, so that a single answer can be made (and accepted). Also, these are questions you should be asking Synopse; Arnaud Bouchez is usually very good at handling Synopse support questions. — Ken White, Apr 15 '12 at 04:28
@Ken Thanks for correction, now only one question. Other questions will be posted elsewhere or in different posts. — Warren, Apr 15 '12 at 04:38
The algorithm embedded within the package is the standard libhnj library. So there should be no diff between Authors' statement. Please provide some code to reproduce the issue. — Arnaud Bouchez, Apr 15 '12 at 18:22
@Warren: Congratulation! First class answer from the paper's author, you are lucky. — menjaraz, Apr 16 '12 at 19:53

score 1 · Accepted Answer · edited Apr 16 '12 at 19:42

1

In short, I believe the reason that the difference in accuracy is so great between what was reported in our SPIRE 2009 paper and the results being reported here is because we trained the method instead of using patterns generated through prior training (which, from what I can gather, is what you are doing here).

How we performed training to obtain our patterns is described briefly on the third page of our paper (pg.176) and in more detail in Section 4.3 of my thesis which you can find here: http://web.cs.dal.ca/~adsett/Adsett_SyllAlgs_2008.pdf

edited Apr 16 '12 at 19:42

menjaraz

7,551
4
41
81

answered Apr 16 '12 at 14:50

Connie Adsett

26
1

@ConnieAdsett Welcome to Stackoverflow. Thank you so much for your participation and discussion, and many thanks for answer and explanation. we are studying the text you advised. We hope you would come here regularly and give suggestions. – Warren Apr 17 '12 at 01:34
@menjaraz I wrote and invited Connie. Connie kindly and immediately responded. – Warren Apr 17 '12 at 01:38
@ConnieAdsett I have finished (briefly) reading your paper. The difference appears to be in pattern used. According to your paper (p.51) the pattern for your conclusion is based on the one that is wholly machine generated by Patgen. My understanding is the very pattern file generated might also be based on another prototype file which contains models like ta-ble, tab-u-late...You train this prototype file and obtain pattern file including 2bu,a2b... So, hyp_en_US.dic from Synopse's distribution is another pattern. Can you direct us to find that prototype file resource for training? – Warren Apr 17 '12 at 05:50
@ConnieAdsett Ned Batchelder as mentioned in your paper creates a hyphenation lib, that is Python version. Ned is also an active expert at Stackoverflow. – Warren Apr 17 '12 at 05:52
@Warren You are correct in your understanding of what we did to obtain the patterns. I would like to emphasize that, in the papers you are reading, we are working on SYLLABIFICATION and not HYPHENATION - there are many similarities but there are also differences (described in brief in pg.1 of my thesis). If you intend to hyphenate, you may not want to use our patterns because they are for syllabification. – Connie Adsett Apr 30 '12 at 14:02
@Warren (cont'd) That said... In my thesis, we tested a number of parameter sets to generate the patterns with Patgen. Those parameter sets are on Table 4.9. The results of these sets and their analysis are in Section 5.1.2 (including the avg # of patterns generated). For work on English syllabification, we used the CELEX database (http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC96L14). Unfortunately, because this must be purchased, I am not free to give you the pattern sets that we generated from it. You must obtain CELEX for yourself and generate your own patterns – Connie Adsett Apr 30 '12 at 14:03
@Warren (cont'd) However, we also did some work on hyphenation that hasn't yet been published. In this work, we used the Moby Hyphenator resource (http://icon.shef.ac.uk/Moby/). We separated common words from proper nouns and, for common words, the best parameter set was the one found in Antos (2007, Table 7). We trained and tested using 10-fold cross-validation and obtained approximately 83% word accuracy. There are roughly 21,000 patterns in each of the 10 training sets. I can provide these pattern sets via e-mail but should also give some warnings. (in next comment) – Connie Adsett Apr 30 '12 at 14:03
@Warren (cont'd) - If you are unfamiliar with 10-fold cross-validation (and there is a good explanation here: http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29), you should know that it would not be correct to combine these pattern sets. If you do, you will be combining multiple results from the same words. - Because this work has not yet been published, I'm hesitant to provide you with the exact word lists from which 10 folds of common words were generated. It would be better for you to take Moby and to generate your own pattern sets yourselves. – Connie Adsett Apr 30 '12 at 14:03

Why does the Synopse hyphenation code give different results from TeX's?

1 Answers1

Linked