0

I need to write a Python programm to convert spanish numbers in string text into digit numbers:

Input:

'Ciento Veinticuatro Mil Ochocientos Treinta y Cinco'

Output desired:

124835

I've wrote some code, but I've realized that I'm reinventing the wheel, just a parser. So, I need to use a lexic/grammar parser module. But I'd never handled before with lexic/grammar parsers and first is needed to write the BNF or PEG notation (I'm not decided yet which parser module I'll use, the simpliest that I can found.)

It's hard for me, the spanish grammar for numerals it's so quite different of the english.

My approach:

<numeral> ::= ([<centenas>][<decenas>][<unidades>])+ [<millares>]

I fear that it's a question for spanish speakers.

Trimax
  • 2,413
  • 7
  • 35
  • 59
  • I'm not sure why you think spanish numbers are so different from english, it's basically the same structure. Of course, there are the fused hundreds (with gender: quinientas), but that's a minor detail and your sample code seems to be on the right track. Try Irish if you want something complicated :) – rici Sep 02 '14 at 03:42
  • @rici Not only the fused hundreds have gender. 21-> "veintiuno", "veintiuna". Some number has three forms (more with accented character): 21->"veintiún", "veintiun", "veintiuno", "veintiuna". Conjunction "y" between tens and units: 35-> "treinta" "y" "cinco". Even so, spanish isn't the most tangled language in my country (Spain), in north is speaked Basque language (also called "Euskera"), its numeral system is a crazy issue: http://www.santurtzieus.com/gelairekia/laguntza/funtzioak/los_numeros.htm – Trimax Sep 02 '14 at 06:54
  • Si pues pero no hay problema en reconocer las variantes; solo tienes que poner todas en tu léxico. Y ignorar la `y`. Yo que tu ignoraría los acentos también; mucha gente no les teclean, especialmente si no tienen un teclado adecuado. Y insisto que irlandés es aun peor que vasco. P.e. quince es "a cúig déag" (a cuíg=5; a deich=10) y 17 es "a seacht déag". Pero. 15 libras: "cuíg phunt déag". 17 libras: "seacht bpunt déag". Libra es "punt", pero los números cambian la palabra siguiente: cinco phunt (funt), siete bpunt (bunt, la p es muda acá). Y se entremezclan: siete bunt diez. – rici Sep 02 '14 at 08:12
  • Also, English is not so simple. For example, a native speaker (yo) would read the range 4050-4100 as "between four thousand fifty and forty-one hundred." Note that "forty hundred" is simply incorrect, while "four thousand one hundred" is possible but uncommon except in cases of emphasis: "there are forty-one hundred -- I repeat, four *thousand* one hundred -- of these..." I don't believe Spanish has this subtlety. – rici Sep 02 '14 at 09:01
  • a billion in spanish is more than in english : ) – 1010 Sep 23 '14 at 02:06

1 Answers1

0

You can achieve this by doing some modifications to text2num library: https://github.com/ghewgill/text2num

import re

Small = {
    'cinco': 5,
    'veinticuatro': 24,
    'treinta': 30,
    'ciento': 100,
    'ochocientos': 800
}


Magnitude = {
    'mil':          1000
}

class NumberException(Exception):
    def __init__(self, msg):
        Exception.__init__(self, msg)

def text2num(s):
    a = re.split(r"[\s-]+", s.lower())
    n = 0
    g = 0
    for w in a:
        if w == 'y':
           continue
        x = Small.get(w, None)
        if x is not None:
            g += x
        else:
            x = Magnitude.get(w, None)
            if x is not None:
                n += g * x
                g = 0
            else:
                raise NumberException("Unknown number: "+w)
    return n + g

if __name__ == "__main__":
    assert 124835 == text2num('Ciento Veinticuatro Mil Ochocientos Treinta y Cinco')
Juan Diego Godoy Robles
  • 14,447
  • 2
  • 38
  • 52