3

I have a string containing prices of items. How do i extract all the prices in the text in a situation where the currency symbol is not known at first.

I got a wristwatch for $500 and i could sell it to a Nigerian for ₦13,000 or to someone in Saudi Arabia for ﷼800

How do i get all the prices and their currency symbols.

Thanks

Kennedy
  • 2,146
  • 6
  • 31
  • 44
  • 1
    What have you tried? It's a pretty trivial regex, you just need a couple wild card values. – Silas Ray Mar 26 '12 at 21:18
  • 1
    You seem to know you need regex. Have you tried it? Match for any of the allowed currency symbols and any number of numbers directly after them. – keyser Mar 26 '12 at 21:20
  • @keyser5053: Yes i can do it with regex. But what do i do in a situation where i need to match a symbol like the Afghanistan afghani. I am finding it difficult to copy that one into the editor. – Kennedy Mar 26 '12 at 21:28
  • @Nedy - see my answer let unicode character class handle the currency symbols – sw1nn Mar 26 '12 at 21:32
  • @sw1nn -Thanks, i am taking a look at it. – Kennedy Mar 26 '12 at 21:42

5 Answers5

4

There is a regular expression character class for currency symbols:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

// (incomplete) list of currency symbols, enhance from http://www.unicode.org/charts/PDF/U20A0.pdf
private static final String CURRENCY_SYMBOLS= "\\p{Sc}\u0024\u060B";

public static void main(String[] args) {
    Pattern p = Pattern.compile("[" +CURRENCY_SYMBOLS + "][\\d,]+");

    Matcher m = p.matcher("I got a wristwatch for $500 and i could sell it to a Nigerian for " +
            "₦13,000 or to someone in Saudi Arabia for ﷼800 or Afghanistan for ؋350");

    while (m.find()) {
        System.out.println(m.group());
    }
 }
}

//Output is:
// $500
// ₦13,000
// ﷼800
// ؋350
sw1nn
  • 7,278
  • 1
  • 26
  • 36
  • I tried it on the above string an it worked. However, doesn't extract prices with latin symbols e.g دج , .د.ب etc – Kennedy Mar 26 '12 at 21:54
  • yes, that's quite annoying. you will have to enhance the regex with all the currency symbols, from one of the other answers - http://www.unicode.org/charts/PDF/U20A0.pdf gives you the list. I'll enhance the code above to illustrate. – sw1nn Mar 26 '12 at 22:12
  • This is great. All i need do is add more currency symbols. Thank you :D – Kennedy Mar 26 '12 at 22:31
  • I had to match decimals such as $500.50 as well. For that, use `("[" +CURRENCY_SYMBOLS + "][\\d,]+[\\.]+[\\d]+")` – aggregate1166877 Apr 22 '13 at 15:07
  • you're allowing ``123....23`` which you probably don't mean. Having said that an exhaustive regex to allow ``123,232,324.34`` but not a variant like ``123,,232...34`` is quite painful. – sw1nn Apr 22 '13 at 16:39
2

I am currently working on a small function using regex to get price amount inside a String :

private static String getPrice(String input)
{
    String output = "";

    Pattern pattern = Pattern.compile("\\d{1,3}[,\\.]?(\\d{1,2})?");
    Matcher matcher = pattern.matcher(input);
    if (matcher.find())
    {
        output = matcher.group(0);
    }

    return output;
}

this seems to work with small price (0,00 to 999,99) and various currency :

$12.34 -> 12.34

$12,34 -> 12,34

$12.00 -> 12.00

$12 -> 12

12€ -> 12

12,11€ -> 12,11

12.999€ -> 12.99

12.9€ -> 12.9

£999.99€ -> 999.99

...

Tobliug
  • 2,992
  • 30
  • 28
2

Instead of adding the currency symbols to the string, you could use \u20a6 in the string for Nigerian currency and \ufdfc in the string for Saudi Arabian currency.

Barry Kaye
  • 7,682
  • 6
  • 42
  • 64
jocopa3
  • 796
  • 1
  • 10
  • 29
  • 1
    the \uXXXX notation is effectively a pre-processor directive, processed before compilation, so as far as the compiler is concerned \u20a6 and ₦ are equivalent. – sw1nn Mar 26 '12 at 22:25
1

For the string above, first you can simply parse for spaces, then if they contain digits get the result.

    String[] strArr = givenString.split(" ");
    List<String> result = new ArrayList<String>();
    for(String s : strArr){
        if(Pattern.compile("[0-9]").matcher(s).find())
            result.add(s);
    }
anvarik
  • 6,417
  • 5
  • 39
  • 53
1

Java has syntax to write all the Unicode symbols it can handle, the syntax looks like '\uffff'

Unicode symbols are quite carefully defined so that related groups can be found. This says its a list of all Unicode currency symbols

Armed with those Unicode symbols in a regex, you could find money anywhere :-)

The Oracle (née Sun) documentation on regular expressions has a whole set of character classes which include currency.

I do not know which version of Unicode is actually implemented. The reference I found at Oracle was "The supported blocks and categories are those of The Unicode Standard, Version 3.0". which according the the Unicode group was September, 1999, so that is what I'd assume.

This does include GBP £ and Euro € so I am okay :-) but it might not be up-to-date, though humanity don't invent currencies too often.

It would be a bit tedious, but you could generate a string with every character code (one at a time), and test for a match to the regex currency symbols, and check that ones that you particularly care about are included.

There is actually a further problem. Different countries use different marks for the decimal point, and some countries put the symbol after the amount. So far I haven't found a great solution to that (http://stackoverflow.com/questions/9185793/how-do-i-get-the-currency-symbol-of-a-currency-as-it-would-appear-in-one-of-its) has no good answer.

So you might need to look for a number on either side of the currency symbol.

gbulmer
  • 4,210
  • 18
  • 20