22

I am trying to convert numerical values written as words into integers. For example,

iPhone has two hundred and thirty thousand seven hundred and eighty three apps

would become

iPhone has 230783 apps

Is there any library or function that does this?

pigrammer
  • 2,603
  • 1
  • 11
  • 24
user132513
  • 691
  • 1
  • 6
  • 7

6 Answers6

22

There are lots of pages discussing the conversion from numbers to words. Not so many for the reverse direction. The best I could find was some pseudo-code on Ask Yahoo. See http://answers.yahoo.com/question/index?qid=20090216103754AAONnDz for a nice algorithm:

Well, overall you are doing two things: Finding tokens (words that translates to numbers) and applying grammar. In short, you are building a parser for a very limited language.

The tokens you would need are:

POWER: thousand, million, billion
HUNDRED: hundred
TEN: twenty, thirty... ninety
UNIT: one, two, three, ... nine,
SPECIAL: ten, eleven, twelve, ... nineteen

(drop any "and"s as they are meaningless. Break hyphens into two tokens. That is sixty-five should be processed as "sixty" "five")

Once you've tokenized your string, move from RIGHT TO LEFT.

  1. Grab all the tokens from the RIGHT until you hit a POWER or the whole string.

  2. Parse the tokens after the stop point for these patterns:

    SPECIAL
    TEN
    UNIT
    TEN UNIT
    UNIT HUNDRED
    UNIT HUNDRED SPECIAL
    UNIT HUNDRED TEN
    UNIT HUNDRED UNIT
    UNIT HUNDRED TEN UNIT

    (This assumes that "seventeen hundred" is not allowed in this grammar)

    This gives you the last three digits of your number.

  3. If you stopped at the whole string you are done.

  4. If you stopped at a power, start again at step 1 until you reach a higher POWER or the whole string.

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
  • Thank you John! This algo is exactly what I was looking for. I was trying to parse it from left to right, but this looks better. Appreciate your help! – user132513 Jul 03 '09 at 05:31
  • I've added an answer below that implements a vaguely similar algorithm. – El Yobo Jun 27 '12 at 05:13
22

Old question, but for anyone else coming across this I had to write up a solution to this today. The following takes a vaguely similar approach to the algorithm described by John Kugelman, but doesn't apply as strict a grammar; as such it will permit some weird orderings, e.g. "one hundred thousand and one million" will still produce the same as "one million and one hundred thousand" (1,100,000). Invalid bits (e.g. misspelled numbers) will be ignored, so the consider the output on invalid strings to be undefined.

Following user132513's comment on joebert's answer, I used Pear's Number_Words to generate test series. The following code scored 100% on numbers between 0 and 5,000,000 then 100% on a random sample of 100,000 numbers between 0 and 10,000,000 (it takes to long to run over the whole 10 billion series).

/**
 * Convert a string such as "one hundred thousand" to 100000.00.
 *
 * @param string $data The numeric string.
 *
 * @return float or false on error
 */
function wordsToNumber($data) {
    // Replace all number words with an equivalent numeric value
    $data = strtr(
        $data,
        array(
            'zero'      => '0',
            'a'         => '1',
            'one'       => '1',
            'two'       => '2',
            'three'     => '3',
            'four'      => '4',
            'five'      => '5',
            'six'       => '6',
            'seven'     => '7',
            'eight'     => '8',
            'nine'      => '9',
            'ten'       => '10',
            'eleven'    => '11',
            'twelve'    => '12',
            'thirteen'  => '13',
            'fourteen'  => '14',
            'fifteen'   => '15',
            'sixteen'   => '16',
            'seventeen' => '17',
            'eighteen'  => '18',
            'nineteen'  => '19',
            'twenty'    => '20',
            'thirty'    => '30',
            'forty'     => '40',
            'fourty'    => '40', // common misspelling
            'fifty'     => '50',
            'sixty'     => '60',
            'seventy'   => '70',
            'eighty'    => '80',
            'ninety'    => '90',
            'hundred'   => '100',
            'thousand'  => '1000',
            'million'   => '1000000',
            'billion'   => '1000000000',
            'and'       => '',
        )
    );

    // Coerce all tokens to numbers
    $parts = array_map(
        function ($val) {
            return floatval($val);
        },
        preg_split('/[\s-]+/', $data)
    );

    $stack = new SplStack; // Current work stack
    $sum   = 0; // Running total
    $last  = null;

    foreach ($parts as $part) {
        if (!$stack->isEmpty()) {
            // We're part way through a phrase
            if ($stack->top() > $part) {
                // Decreasing step, e.g. from hundreds to ones
                if ($last >= 1000) {
                    // If we drop from more than 1000 then we've finished the phrase
                    $sum += $stack->pop();
                    // This is the first element of a new phrase
                    $stack->push($part);
                } else {
                    // Drop down from less than 1000, just addition
                    // e.g. "seventy one" -> "70 1" -> "70 + 1"
                    $stack->push($stack->pop() + $part);
                }
            } else {
                // Increasing step, e.g ones to hundreds
                $stack->push($stack->pop() * $part);
            }
        } else {
            // This is the first element of a new phrase
            $stack->push($part);
        }

        // Store the last processed part
        $last = $part;
    }

    return $sum + $stack->pop();
}
El Yobo
  • 14,823
  • 5
  • 60
  • 78
  • 1
    I found this to be a very robust and succinct solution. Great job! My only edit was to add `'lakh' => '100000'` and `'crore' => '10000000'` as mentioned by user132513 in joeberts answer. – Khior Feb 13 '13 at 12:20
  • One use case where this doesn't work is, for example `$data= 'five or ten'`. This returns 50. The answer above works well for the OP. However, one must consider the string to have "correct" formatting. In my case, I was trying to strip out the number out of an unchecked string, without controlling (or knowing) what the string could be. Users sometimes put some pretty strange responses into forms! – Sablefoste Dec 18 '14 at 15:52
  • "Invalid bits (e.g. misspelled numbers) will be ignored, so the consider the output on invalid strings to be undefined"; unfortunately this is only intended to convert a string containing a single number. You could trying splitting your string into fragments using the `$data` list above (as those are the only substrings that we care about) and then run it on each fragment, then combine the results using the split words. – El Yobo Dec 18 '14 at 21:14
  • @ElYobo It works well , Except that if the **$data** value is **Ten** instead of **ten** then it returns **0** instead of **10** . Kindly help out for this **case sensitivity** mate. – Raja Gopal Sep 25 '16 at 05:32
  • Call `wordsToNumber` with a lower case string, e.g. `wordsToNumber(strtolower($my_string))`. – El Yobo Sep 27 '16 at 00:42
  • 1
    I would recommend making this the first line: ```$data = strtolower(trim($data));```. This addresses the point made by @RajaGopal – sean.boyer Mar 15 '17 at 15:00
4

I haven't tested this too extensively, I more or less just worked on it until I saw what I expected in the output, but it seems to work, and parses from left-to-right.

<?php

$str = 'twelve billion people know iPhone has two hundred and thirty thousand, seven hundred and eighty-three apps as well as over one million units sold';

function strlen_sort($a, $b)
{
    if(strlen($a) > strlen($b))
    {
        return -1;
    }
    else if(strlen($a) < strlen($b))
    {
        return 1;
    }
    return 0;
}

$keys = array(
    'one' => '1', 'two' => '2', 'three' => '3', 'four' => '4', 'five' => '5', 'six' => '6', 'seven' => '7', 'eight' => '8', 'nine' => '9',
    'ten' => '10', 'eleven' => '11', 'twelve' => '12', 'thirteen' => '13', 'fourteen' => '14', 'fifteen' => '15', 'sixteen' => '16', 'seventeen' => '17', 'eighteen' => '18', 'nineteen' => '19',
    'twenty' => '20', 'thirty' => '30', 'forty' => '40', 'fifty' => '50', 'sixty' => '60', 'seventy' => '70', 'eighty' => '80', 'ninety' => '90',
    'hundred' => '100', 'thousand' => '1000', 'million' => '1000000', 'billion' => '1000000000'
);


preg_match_all('#((?:^|and|,| |-)*(\b' . implode('\b|\b', array_keys($keys)) . '\b))+#i', $str, $tokens);
//print_r($tokens); exit;
$tokens = $tokens[0];
usort($tokens, 'strlen_sort');

foreach($tokens as $token)
{
    $token = trim(strtolower($token));
    preg_match_all('#(?:(?:and|,| |-)*\b' . implode('\b|\b', array_keys($keys)) . '\b)+#', $token, $words);
    $words = $words[0];
    //print_r($words);
    $num = '0'; $total = 0;
    foreach($words as $word)
    {
        $word = trim($word);
        $val = $keys[$word];
        //echo "$val\n";
        if(bccomp($val, 100) == -1)
        {
            $num = bcadd($num, $val);
            continue;
        }
        else if(bccomp($val, 100) == 0)
        {
            $num = bcmul($num, $val);
            continue;
        }
        $num = bcmul($num, $val);
        $total = bcadd($total, $num);
        $num = '0';
    }
    $total = bcadd($total, $num);
    echo "$total:$token\n";
    $str = preg_replace("#\b$token\b#i", number_format($total), $str);
}
echo "\n$str\n";

?>
joebert
  • 2,653
  • 2
  • 18
  • 22
  • Found one flaw, it misses common mixtures of numbers and words such as "2 million". – joebert Jul 03 '09 at 06:21
  • It will also mess with certain wordings for dates. "I was born in nineteen eighty one" – joebert Jul 03 '09 at 06:27
  • Thank you very much Joebert for the code! I'll try to improve on it. I have set up a test set of 10000 random number words (using the Numbers_Words) and currently, the accuracy of decoding words to numbers is 75%. Correct : forty five thousand five hundred and fifty four becomes 45554 Incorrect: fifty one thousand five hundred and eighty six becomes 586 – user132513 Jul 09 '09 at 01:17
  • Just realized the issue. There is something funny happening while accessing the first key, i.e. 'one' Instead put 'quadrillion' => '1000000000000000' before 'one' and it works with 100% accuracy. – user132513 Jul 10 '09 at 01:28
  • Also, include 'lakh' => '100000' and 'crore' => '10000000' in $keys. They are more common terms than million in south asian countries – user132513 Jul 10 '09 at 01:30
  • That makes sense. I have a filesize formatter that works similarly. I must have been in a rush and forgot to put the largest numbers first in the check. – joebert Jul 14 '09 at 15:02
2

Somewhat updated El Yobo's answer, now one can run wordsToNumber function over (almost) any string containing numerals.

https://github.com/thefish/words-to-number-converter

converter.php - converter itself

test.php - test with various strings

UPD 22.10.2020: Answer become too big to maintain. moved code to github.

thefish
  • 46
  • 4
  • This works much better than the others for address numbers, "one two three main street" or "one twenty three main street". – Jay A. Little Oct 07 '18 at 11:27
  • @thefish love the mixed input support but some strings like "thirty three thousand and fifty nine dollars" doesn't work. The "and" causes problems. – Enlai Oct 11 '20 at 11:07
1

The simplest way I've found is to use numfmt_parse:

$fmt = numfmt_create('en_US', NumberFormatter::SPELLOUT);
echo numfmt_parse($fmt, 'one million two hundred thirty-four thousand five hundred sixty-seven');

(source; Dorian's post at https://stackoverflow.com/a/31588055/11827985):

will
  • 153
  • 3
  • 6
0

The PEAR Numbers_Words package is probably a good start: http://pear.php.net/package-info.php?package=Numbers_Words

Jani Hartikainen
  • 42,745
  • 10
  • 68
  • 86
  • Thanks Jani. This package looks interesting, though this does the vice versa of my aim, i.e. from numbers to words. Would be useful in future projects. – user132513 Jul 03 '09 at 05:34