6
#!/usr/local/bin/perl
use strict;
use warnings;

use Text::SpellChecker;

my $text = "coördinator";
my $checker = Text::SpellChecker->new( text => $text );

while ( my $word = $checker->next_word ) {
    print "Bad word is $word\n";
}

Output: Bad word is rdinator

Desired: Bad word is coördinator

The module is breaking if I have Unicode in $text. Any idea how can this be solved?

I have Aspell 0.50.5 installed which is being used by this module. I think this might be the culprit.

Edit: As Text::SpellChecker requires either Text::Aspell or Text::Hunspell, I removed Text::Aspell and installed Hunspell, Text::Hunspell, then:

$ hunspell -d en_US -l < badword.txt
coördinator

Shows correct result. This means there's something wrong either with my code or Text::SpellChecker.


Taking Miller's suggestion in consideration I did the below

#!/usr/local/bin/perl
use strict;
use warnings;
use Text::SpellChecker;
use utf8;
binmode STDOUT, ":encoding(utf8)";
my $text =  "coördinator";
my $flag = utf8::is_utf8($text);
print "Flag is $flag\n";
print "Text is $text\n";
my $checker = Text::SpellChecker->new(text => $text);
while (my $word = $checker->next_word) {
    print "Bad word is $word\n";
}

OUTPUT:

Flag is 1
Text is coördinator
Bad word is rdinator

Does this mean the module is not able to handle utf8 characters properly?

Chankey Pathak
  • 21,187
  • 12
  • 85
  • 133
  • 1
    I'd recommend properly handling utf8 with regard to your source file and your output streams. However, this issue also occurs while doing such things and using Aspell 0.60.6.1. – Miller Nov 03 '14 at 06:39
  • Hi Miller, see the updated question. – Chankey Pathak Nov 03 '14 at 07:15
  • Could you add printing `$ENV{LANG}` to your test? – AnFi Nov 03 '14 at 08:54
  • ö is a German diacritics and I think can be handled by latin-1 so you actually don't need UTF-8 ..if you have 2 byte chars.. also as mentioned in previous comments UTF data needs to be treated correctly during STDIN and STDOUT also you need support from the vendor. – ppant Nov 03 '14 at 09:08
  • @AndrzejA.Filip: `en_US.UTF-8` – Chankey Pathak Nov 03 '14 at 09:10

2 Answers2

4

It is Text::SpellChecker bug - the current version assumes ASCII only words.

http://cpansearch.perl.org/src/BDUGGAN/Text-SpellChecker-0.11/lib/Text/SpellChecker.pm

#
# next_word
# 
# Get the next misspelled word. 
# Returns false if there are no more.
#
sub next_word {
    ...
    while ($self->{text} =~ m/([a-zA-Z]+(?:'[a-zA-Z]+)?)/g) {

IMHO the best fix would use per language/locale word splitting regular expression or leave word splitting to underlaying library used. aspell list reports coördinator as single word.

AnFi
  • 10,493
  • 3
  • 23
  • 47
  • 1
    Thanks, that seems to be the problem. I tried changing it to `while ($self->{text} =~ m/(\p{L}+(?:'\p{L}+)?)/g) {` and it seems to be working fine now. Is this correct? – Chankey Pathak Nov 03 '14 at 10:59
3

I've incorporated Chankey's solution and released version 0.12 to the CPAN, give it a try.

The validity of diaeresis in words like coördinator is interesting. The default aspell and hunspell dictionaries seem to mark it as incorrect, though some publications may disagree.

best, Brian

Brian
  • 31
  • 2