8

Works:

#!/usr/bin/env python3
from uniseg.graphemecluster import grapheme_clusters
def albanian_digraph_dh(s, breakables):
    for i, breakable in enumerate(breakables):
        if s.endswith('d', 0, i) and s.startswith('h', i):
            yield 0
        else:
            yield breakable

print(list(grapheme_clusters('dhelpëror', albanian_digraph_dh)))
#['dh', 'e', 'l', 'p', 'ë', 'r', 'o', 'r']

Needs improvement/customisation:

perl -C -Mutf8 -mUnicode::GCString -E'
    say join " ", Unicode::GCString
        ->new("dhelpëror")->as_array
'
#d h e l p ë r o r

perl6 -e'"dhelpëror".comb.say'
#(d h e l p ë r o r)

NB: writing your own segmentation which is almost guaranteed to not implement UAX #29 correctly counts as side-stepping the problem.

Elizabeth Mattijsen
  • 25,654
  • 3
  • 75
  • 105
daxim
  • 39,270
  • 4
  • 65
  • 132
  • Does the python example count as side-stepping? – Shawn Aug 23 '19 at 09:13
  • No, it doesn't. It demonstrates customisation (implemented as a callback), but otherwise runs the correct segmentation algorithm. – daxim Aug 23 '19 at 09:27
  • "[uniseg] ... runs the correct segmentation algorithm" A fundamental issue with these libraries is their long term maintenance. Are you sure your uniseg is getting it right? When I looked around a year ago the most up-to-date fork of uniseg I found was using out of date grapheme segmentation rules. (Fwiw, the most promising looking python grapheme segmentation library I found was the then newish [grapheme](https://libraries.io/pypi/grapheme) which, for example, [implemented the Unicode 11 rule changes](https://github.com/alvinlindstam/grapheme/commit/8d7b6cf096738a7f1e3a6e9e470e14f1b5a6810b)). – raiph Aug 23 '19 at 22:47
  • I should clarify. When I looked last year it seemed that `grapheme` was the most promising *pure Python* library. (Though, to be clear, it's only claiming alpha status. uniseg is a stage better "on paper" in that it claims beta status but then again it hasn't been updated in years afaict.) Based on my research last year the serious solution in the Python space is [the PyICU project](https://pypi.org/project/PyICU/) which wraps libraries from [the ICU project](http://site.icu-project.org/home). – raiph Aug 24 '19 at 00:32
  • Any idea of thow that Python library works? It's probably enumerating all possible digraphs in all languages, is that correct? – jjmerelo Aug 26 '19 at 06:57
  • @raiph because it's selecting digraphs for Albanian; it might have others. ch is a digraph in Spanish, for instance. – jjmerelo Aug 27 '19 at 07:23
  • @raiph as daxim says, writing your own segmentation is sidestepping the problem. There's an Unicode specification for grapheme clusters, which is where all such digraphs by language are defined. I think the Python implementation does that. – jjmerelo Aug 27 '19 at 08:32
  • 1
    @jjmerelo "how that Python library works?". Here's the "official" [`uniseg` repo's `graphemecluster.py` that contains the `grapheme_clusters` function](https://bitbucket.org/emptypage/uniseg-python/raw/e4077d17d026c36999b89c10081a85b219e1fb7b/uniseg/graphemecluster.py) daxim used. To use tailored (eg locale specific) grapheme clustering one must manually edit `break_table` and/or supply a custom callback as the `tailor` argument to, eg, `grapheme_clusters` as daxim did. It can't enumerate all possible digraphs because they conflict. It's only selecting Albanian cuz daxim manually wrote that. – raiph Aug 27 '19 at 17:51
  • @jjmerelo "Unicode specification ... where all such digraphs by language are defined." There's a critically important reason why **the Unicode consortium** does *not* refer to most language/locale specific data as a "specification". ["How can I get Unicode implementations to recognize the digraph more generally?"](https://www.unicode.org/faq/ligature_digraph.html#Dig6) links to the [CLDR](http://cldr.unicode.org/). The CLDR will have Albanian digraph data but it may change from release to release without notice and is *not* [a CLDR specification](http://cldr.unicode.org/index/cldr-spec). – raiph Aug 27 '19 at 18:13
  • @jjmerelo "I think the Python implementation [uses the CLDR] where all such digraphs by language are defined." If by "Python implementation" you mean the library daxim used, the pure python `uniseg`, then not really. Someone can *manually edit* the module as explained in my first comment to you above but they must maintain it themselves. The `grapheme` library (which I linked in a comment above) *is* maintained but, afaik, also doesn't bundle CLDR data. Finally, the `PyICU` module (also linked above; it wraps C++ ICU libs) explicitly aims to track CLDR. – raiph Aug 27 '19 at 18:33

1 Answers1

7
D:\>perl6 -e "'dhelpëror'.comb(/dh|./).say"
(dh e l p ë r o r)

You can do the same in old Perl.

print join ' ', 'dhelpëror' =~ /dh|./g
Holli
  • 5,072
  • 10
  • 27
  • 4
    I think the "old Perl" solution should be [`\X`](https://perldoc.perl.org/perlrebackslash.html#Misc) instead of `.`. If it doesn't make a difference for the given string that's because all the graphemes are single codepoints. – raiph Aug 23 '19 at 14:32
  • Also rather useful is the ability to reuse regexes. `my $albanian = /dh|./; say "a.b.c.dh.e.f.g.h.".comb(/$albanian /)` --> `(a. b. c. dh. e. f. g. h.)` In a grammar, `my token albanian { dh | . }` and then use `` wherever you need to use that segmentation rule. – user0721090601 Aug 25 '19 at 00:22