How do you customise text segmentation to not break between a digraph?

Question

Works:

#!/usr/bin/env python3
from uniseg.graphemecluster import grapheme_clusters
def albanian_digraph_dh(s, breakables):
    for i, breakable in enumerate(breakables):
        if s.endswith('d', 0, i) and s.startswith('h', i):
            yield 0
        else:
            yield breakable

print(list(grapheme_clusters('dhelpëror', albanian_digraph_dh)))
#['dh', 'e', 'l', 'p', 'ë', 'r', 'o', 'r']

Needs improvement/customisation:

perl -C -Mutf8 -mUnicode::GCString -E'
    say join " ", Unicode::GCString
        ->new("dhelpëror")->as_array
'
#d h e l p ë r o r

perl6 -e'"dhelpëror".comb.say'
#(d h e l p ë r o r)

NB: writing your own segmentation which is almost guaranteed to not implement UAX #29 correctly counts as side-stepping the problem.

No, it doesn't. It demonstrates customisation (implemented as a callback), but otherwise runs the correct segmentation algorithm. — daxim, Aug 23 '19 at 09:27
"[uniseg] ... runs the correct segmentation algorithm" A fundamental issue with these libraries is their long term maintenance. Are you sure your uniseg is getting it right? When I looked around a year ago the most up-to-date fork of uniseg I found was using out of date grapheme segmentation rules. (Fwiw, the most promising looking python grapheme segmentation library I found was the then newish [grapheme](https://libraries.io/pypi/grapheme) which, for example, [implemented the Unicode 11 rule changes](https://github.com/alvinlindstam/grapheme/commit/8d7b6cf096738a7f1e3a6e9e470e14f1b5a6810b)). — raiph, Aug 23 '19 at 22:47
I should clarify. When I looked last year it seemed that `grapheme` was the most promising *pure Python* library. (Though, to be clear, it's only claiming alpha status. uniseg is a stage better "on paper" in that it claims beta status but then again it hasn't been updated in years afaict.) Based on my research last year the serious solution in the Python space is [the PyICU project](https://pypi.org/project/PyICU/) which wraps libraries from [the ICU project](http://site.icu-project.org/home). — raiph, Aug 24 '19 at 00:32
Any idea of thow that Python library works? It's probably enumerating all possible digraphs in all languages, is that correct? — jjmerelo, Aug 26 '19 at 06:57
@raiph because it's selecting digraphs for Albanian; it might have others. ch is a digraph in Spanish, for instance. — jjmerelo, Aug 27 '19 at 07:23
@raiph as daxim says, writing your own segmentation is sidestepping the problem. There's an Unicode specification for grapheme clusters, which is where all such digraphs by language are defined. I think the Python implementation does that. — jjmerelo, Aug 27 '19 at 08:32
@jjmerelo "how that Python library works?". Here's the "official" [`uniseg` repo's `graphemecluster.py` that contains the `grapheme_clusters` function](https://bitbucket.org/emptypage/uniseg-python/raw/e4077d17d026c36999b89c10081a85b219e1fb7b/uniseg/graphemecluster.py) daxim used. To use tailored (eg locale specific) grapheme clustering one must manually edit `break_table` and/or supply a custom callback as the `tailor` argument to, eg, `grapheme_clusters` as daxim did. It can't enumerate all possible digraphs because they conflict. It's only selecting Albanian cuz daxim manually wrote that. — raiph, Aug 27 '19 at 17:51
@jjmerelo "Unicode specification ... where all such digraphs by language are defined." There's a critically important reason why **the Unicode consortium** does *not* refer to most language/locale specific data as a "specification". ["How can I get Unicode implementations to recognize the digraph more generally?"](https://www.unicode.org/faq/ligature_digraph.html#Dig6) links to the [CLDR](http://cldr.unicode.org/). The CLDR will have Albanian digraph data but it may change from release to release without notice and is *not* [a CLDR specification](http://cldr.unicode.org/index/cldr-spec). — raiph, Aug 27 '19 at 18:13
@jjmerelo "I think the Python implementation [uses the CLDR] where all such digraphs by language are defined." If by "Python implementation" you mean the library daxim used, the pure python `uniseg`, then not really. Someone can *manually edit* the module as explained in my first comment to you above but they must maintain it themselves. The `grapheme` library (which I linked in a comment above) *is* maintained but, afaik, also doesn't bundle CLDR data. Finally, the `PyICU` module (also linked above; it wraps C++ ICU libs) explicitly aims to track CLDR. — raiph, Aug 27 '19 at 18:33

Holli · Accepted Answer · 2019-08-23T12:55:36.563

7

D:\>perl6 -e "'dhelpëror'.comb(/dh|./).say"
(dh e l p ë r o r)

You can do the same in old Perl.

print join ' ', 'dhelpëror' =~ /dh|./g

edited Aug 23 '19 at 12:55

answered Aug 23 '19 at 12:41

Holli

5,072
10
27

4

I think the "old Perl" solution should be [`\X`](https://perldoc.perl.org/perlrebackslash.html#Misc) instead of `.`. If it doesn't make a difference for the given string that's because all the graphemes are single codepoints. – raiph Aug 23 '19 at 14:32
Also rather useful is the ability to reuse regexes. `my $albanian = /dh|./; say "a.b.c.dh.e.f.g.h.".comb(/$albanian /)` --> `(a. b. c. dh. e. f. g. h.)` In a grammar, `my token albanian { dh | . }` and then use `` wherever you need to use that segmentation rule. – user0721090601 Aug 25 '19 at 00:22

How do you customise text segmentation to not break between a digraph?

1 Answers1