Text::SpellChecker module and Unicode

Question

#!/usr/local/bin/perl
use strict;
use warnings;

use Text::SpellChecker;

my $text = "coördinator";
my $checker = Text::SpellChecker->new( text => $text );

while ( my $word = $checker->next_word ) {
    print "Bad word is $word\n";
}

Output: Bad word is rdinator

Desired: Bad word is coördinator

The module is breaking if I have Unicode in $text. Any idea how can this be solved?

I have Aspell 0.50.5 installed which is being used by this module. I think this might be the culprit.

Edit: As Text::SpellChecker requires either Text::Aspell or Text::Hunspell, I removed Text::Aspell and installed Hunspell, Text::Hunspell, then:

$ hunspell -d en_US -l < badword.txt
coördinator

Shows correct result. This means there's something wrong either with my code or Text::SpellChecker.

Taking Miller's suggestion in consideration I did the below

#!/usr/local/bin/perl
use strict;
use warnings;
use Text::SpellChecker;
use utf8;
binmode STDOUT, ":encoding(utf8)";
my $text =  "coördinator";
my $flag = utf8::is_utf8($text);
print "Flag is $flag\n";
print "Text is $text\n";
my $checker = Text::SpellChecker->new(text => $text);
while (my $word = $checker->next_word) {
    print "Bad word is $word\n";
}

OUTPUT:

Flag is 1
Text is coördinator
Bad word is rdinator

Does this mean the module is not able to handle utf8 characters properly?

I'd recommend properly handling utf8 with regard to your source file and your output streams. However, this issue also occurs while doing such things and using Aspell 0.60.6.1. — Miller, Nov 03 '14 at 06:39
ö is a German diacritics and I think can be handled by latin-1 so you actually don't need UTF-8 ..if you have 2 byte chars.. also as mentioned in previous comments UTF data needs to be treated correctly during STDIN and STDOUT also you need support from the vendor. — ppant, Nov 03 '14 at 09:08

AnFi · Answer 1 · 2014-11-03T09:18:09.223

4

It is Text::SpellChecker bug - the current version assumes ASCII only words.

http://cpansearch.perl.org/src/BDUGGAN/Text-SpellChecker-0.11/lib/Text/SpellChecker.pm

#
# next_word
# 
# Get the next misspelled word. 
# Returns false if there are no more.
#
sub next_word {
    ...
    while ($self->{text} =~ m/([a-zA-Z]+(?:'[a-zA-Z]+)?)/g) {

IMHO the best fix would use per language/locale word splitting regular expression or leave word splitting to underlaying library used. aspell list reports coördinator as single word.

edited Nov 03 '14 at 09:18

answered Nov 03 '14 at 09:04

AnFi

10,493
3
23
47

1

Thanks, that seems to be the problem. I tried changing it to `while ($self->{text} =~ m/(\p{L}+(?:'\p{L}+)?)/g) {` and it seems to be working fine now. Is this correct? – Chankey Pathak Nov 03 '14 at 10:59

score 3 · Answer 2 · answered Nov 04 '14 at 03:02

I've incorporated Chankey's solution and released version 0.12 to the CPAN, give it a try.

The validity of diaeresis in words like coördinator is interesting. The default aspell and hunspell dictionaries seem to mark it as incorrect, though some publications may disagree.

best, Brian

Text::SpellChecker module and Unicode

2 Answers2