How to parse kanji numeric characters using ICU?

Question

I'm writing a function using ICU to parse an Unicode string which consists of kanji numeric character(s) and want to return the integer value of the string.

"五" => 5
"三十一" => 31
"五千九百七十二" => 5972

I'm setting the locale to Locale::getJapan() and using the NumberFormat::parse() to parse the character string. However, whenever I pass it any Kanji characters, the parse() method is returning U_INVALID_FORMAT_ERROR.

Does anyone know if ICU supports Kanji character strings in the NumberFormat::parse() method? I was hoping that since I'm setting the Locale to Japanese that it would be able to parse Kanji numeric values.

Thanks!

#include <iostream>
#include <unicode/numfmt.h>

using namespace std;

int main(int argc, char **argv) {
    const Locale &jaLocale = Locale::getJapan();
    UErrorCode status = U_ZERO_ERROR;
    NumberFormat *nf = NumberFormat::createInstance(jaLocale, status);

    UChar number[] = {0x4E94}; // Character for '5' in Japanese '五'
    UnicodeString numStr(number);
    Formattable formattable;
    nf->parse(numStr, formattable, status);
    if (U_FAILURE(status)) {
        cout << "error parsing as number: " << u_errorName(status) << endl;
        return(1);
    }
    cout << "long value: " << formattable.getLong() << endl;
}

I don't know, but it's a cool question, I'm looking forward to an answer. — Charlie Martin, Apr 28 '09 at 01:54
are you asking about the algorithm of how to solve the problem? or are you asking about getting the character codes in order to interpret them (i.e. encoding problem)? — hasen, May 04 '09 at 09:52
Thanks for all the answers and comments! To clarify what I'm looking for is whether ICU is able to correctly parse strings with kanji numeric values and return back the number as an integer. I'm restricted to using ICU and if ICU is able to do this, then I wouldn't have to write my own routine to handle this. I'm developing a program to support this for different locales, and would prefer not to write customized routines for each locale. Ideally, I just want to pass the locale and the data string to ICU, and have it return the integer value. — , May 04 '09 at 19:11
See answer of "Steven R. Loomis" it is the correct one. RBNF supports it. — Artyom, Oct 12 '09 at 08:30

Steven R. Loomis · Answer 1 · 2011-10-03T16:57:01.540

6

You can use the ICU Rule Based Number Format (RBNF) module rbnf.h (C++) or for C, in unum.h with the UNUM_SPELLOUT option, both with the "ja" locale for Japanese. Atryom provides a correction to your code for C++: new RuleBasedNumberFormat(URBNF_SPELLOUT,jaLocale, status);

edited Oct 03 '11 at 16:57

answered Oct 07 '09 at 17:29

Steven R. Loomis

4,228
28
39

1

This is the correct answer: instread: `NumberFormat::createInstance(jaLocale, status);` use `new RuleBasedNumberFormat(URBNF_SPELLOUT,jaLocale, status);` – Artyom Oct 12 '09 at 08:28

si28719e · Answer 2 · 2009-05-04T09:54:15.480

I created a small perl module to do this a while back. it can convert arabic<=>japanese and though I haven't tested it exhaustively i think it's pretty comprehensive. feel free to improve it.

 
package kanjiArabic;
use strict;
use warnings;
our $VERSION = "1.00";
use utf8;

our %big = (
    十 => 10,百 => 100,千 => 1000,
    );
our %bigger = (
    万 => 10000,億 => 100000000,
    兆 => 1000000000000,京 => 10000000000000000,
    垓 => 100000000000000000000,
    );
#precompile regexes                                                                                                          
our $qr = qr/[0-9]/;
our $bigqr = qr/[十百千]/;
our $biggerqr = qr/[万億兆京垓]/;

#this routine does most of the real work.
sub kanji2arabic{
    $_ = shift;

    tr/〇一二三四五六七八九/0123456789/;
    #optionally precompile for performance boost                                                                             
    s/(?<=${qr})(${bigqr})/\*${1}/g;
    s/(?<=${bigqr})(${bigqr})/\+${1}/g;
    s/(${bigqr})(?=${qr})/${1}\+/g;
    s/(${bigqr})(?=${bigqr})/${1}\+/g;
    s/(${bigqr})/${big{$1}}/g;

    s/([0-9\+\*]+)/\(${1}\)/g;

    s/(? "〇", 1 => "一", 2 => "二", 3 => "三", 4 => "四",
    5 => "五", 6 => "六", 7 => "七", 8 => "八", 9 => "九",
    );
our %places = (
    1 => 10, 
    2 => 100, 
    3 => 1000, 
    4 => 10000, 
    8 => 100000000, 
    12 => 1000000000000,
    16 => 10000000000000000, 
    20 => 100000000000000000000,
    );
our %abig   = (
    10 => "十", 
    100 => "百", 
    1000 => "千", 
    10000 => "万", 
    100000000 => "億",
    1000000000000 => "兆", 
    10000000000000000 => "京", 
    100000000000000000000 => "垓",
    );
our $MAX = 24; #We only support numbers up to 24 digits!                                                                     


sub arabic2kanji{
    my @number = reverse(split(//,$_[0]));
    my @kanji;
    for(my $i=$#number;$i>=0;$i--){
        if( $i==0 ){push(@kanji,$asmall{$number[$i]});}
        elsif( $i % 4 == 0 ){
            if( $number[$i] !~ m/[01]/ ){
                push(@kanji,$asmall{$number[$i]});
            }
            push(@kanji,$abig{$places{$i}});
    }else{
            my $p = $i % 4;
            if( $number[$i]==0 ){
                next;
            }elsif( $number[$i]==1 ){
                push(@kanji,$abig{$places{$p}});
            }else{
                push(@kanji,$asmall{$number[$i]});
        push(@kanji,$abig{$places{$p}});
            }
    }
    }
    return join("",@kanji);
}


sub eval_k2a{
    #feed me utf-8!                                                                                                          
    if($_[0] !~ m/^[〇一二三四五六七八九十百千万億兆京垓]+$/){
        print "Error: ".$_[0].
              " not a Kanji number.\n" if defined($_[1])&&$_[1]==1;
        return -1;
    }
    my $expression = kanji2arabic($_[0]);
    print $expression."\n" if defined($_[1])&&$_[1]==1;
    return eval($expression);
}



1;

you'd then call it from another script like so,


#!/usr/bin/perl -w
use strict;
use warnings;
use Encode;
use kanjiArabic;

my $kanji = kanjiArabic::arabic2kanji($ARGV[0]);
print "Kanji: ".encode("utf8",$kanji)."\n";
my $arabic =  kanjiArabic::eval_k2a($kanji);
print "Back to arabic...\n";
print "Arabic: ".$arabic."\n";

and use this script like so,


kettle:~/k2a$ ./k2a.pl 5000215
Kanji: 五百万二百十五
Back to arabic...
Arabic: 5000215

rock on.

score 1 · Answer 3 · answered Apr 28 '09 at 05:21

1

I was inspired by your question to solve this problem using Python.

If you don't find a C++ solution, it shouldn't be too hard to adapt this to C++.

answered Apr 28 '09 at 05:21

Ryan Ginstrom

13,915
5
45
60

score 0 · Answer 4 · answered Mar 23 '10 at 05:37

0

This is actually quite difficult, especially if you start looking at the obsucre kanji for very large numbers.

In perl, there is a very complete implementaion in Lingua::JA::Numbers. It's source might be inspirational if you want to port it to C++.

answered Mar 23 '10 at 05:37

Gavin Brock

5,027
1
30
33

How to parse kanji numeric characters using ICU?

4 Answers4

Linked