3

How to write regex that can 1. match currency which may or may not include a comma or decimal and 2. match the currency code only. What I see is mostly matching currency symbols. I want to be able to match currency ['300,000.00'] and currency code ['USD'] from a complete text such as this:

Userid 9XXXX219 sales USD300,000.00 On 01-JUL-2016 08:34:32

So far I tried this but it matches only the ones with decimal, not the ones without decimal or the ones with comma:

s = 'USD1 USD1.00 USD100.00 USD1,000 CAD1,000.00'
re.findall(r'\d+\.\d+', s)
#matches
['1.00', '100.00', '000.00']

#should not match any other thing e.g. 1XXXX324

#instead of this:
['1','1.00', '100.00', '1,000', '1,000.00']

And how to write another regex pattern to match ONLY currency codes? i.e.

['USD', 'USD', 'USD', 'USD','CAD'] 
DougKruger
  • 4,424
  • 14
  • 41
  • 62

3 Answers3

4

Get the Currency:

Having an exhaustive list of valid currencies might not be feasible but if it is a limited number of currencies then you can do that like this:

re.findall('USD|CAD','USD1 USD1.00 USD100.00 USD1,000 CAD1,000.00 123XXX123')

Output:

['USD', 'USD', 'USD', 'USD', 'CAD']

Get the amount:

Using capturing group, re.findall returns each capture in a tuple. Using a non-capturing group (?:) will solve the issue.

re.findall('(?<=USD|CAD)\d{1,3}(?:,\d{3})*(?:\.\d+)?(?=\s)','Userid 9XXXX219 sales USD300,000.00 On 01-JUL-2016 08:34:32')

Output:

['300,000.00']

Illustration with the example text:

re.findall('(?<=USD|CAD)\d{1,3}(?:,\d{3})*(?:\.\d+)?(?=\s)','USD1 USD1.00 USD100.00 USD1,000 CAD1,000.00 123XXX123')

Output:

['1', '1.00', '100.00', '1,000', '1,000.00']

Read more the following here:

(?=) - positive lookahead (?<=) - positive lookbehind

Community
  • 1
  • 1
gaganso
  • 2,914
  • 2
  • 26
  • 43
  • thank you, but rather than have both patterns together, is it possible to split them into two different patterns where one matches currency code and the second matches currency. I noticed without `(?:USD|CAD)` it matches `1XXXX324` as `['1', '324']` – DougKruger Jul 01 '16 at 11:47
  • @KrugerBr, yes. That can be done. As you have tried, split the regex. Can you edit the question and post the actual text from which you want to get these results? Because finding an ideal regex that satisfies all the conditions is too complex. – gaganso Jul 01 '16 at 11:50
  • @KrugerBr, please check and let me know if this is working. Also, if this answer has helped, upvote it. If it solves your problem, click on the tick mark below the upvote/downvote arrows to accept the answer. – gaganso Jul 01 '16 at 12:02
  • the example in my question `'Userid 9XXXX219 sales USD300,000.00 On 01-JUL-2016 08:34:32'` matches `['219', '300,000.00', '016']` rather than just `['300,000.00']` – DougKruger Jul 01 '16 at 12:07
  • I updated the example to `'Userid 9XXXX219 sales USD300,000.00 On 01-JUL-2016 08:34:32'` – DougKruger Jul 01 '16 at 12:39
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/116196/discussion-between-krugerbr-and-silentmonk). – DougKruger Jul 01 '16 at 12:56
1

'\d+([.,]?\d*)*' that should match all cases. If you want, you can also add space. Like this:

'\d+([., ]?\d*)*' 

For the currency codes: '[A-Z]{3}' should work.

P.S. As per SilentMonk suggestion for the non-capture groups:

(?:[A-Z]{3})(?:\d+(?:[.,]?\d*)*)
Maria Ivanova
  • 1,146
  • 10
  • 19
  • Nice one. But wouldn't a (USD | CAD .....) be a better regex for currencies since `ABC123` will also get matched here ? – gaganso Jul 01 '16 at 11:14
  • 1
    Only if you are sure that the OP needs only these 2 currencies. Considering the list of currencies globally, and that it might grow every day, I wouldn't be surprised if tomorrow there is a valid currency ABC. – Maria Ivanova Jul 01 '16 at 11:16
  • Yeah, let's see what the OP requires. – gaganso Jul 01 '16 at 11:17
  • Your RE also matches 1.00.00.00 – Devi Prasad Khatua Jul 01 '16 at 11:17
  • Depending on the locale, it might be a valid number (1 million). Although, usually the digits are separated by 3. Still, this is the simplest regex that would work for the OP. We can improve it and complexify it as much as we want. :) I would think of a better option. – Maria Ivanova Jul 01 '16 at 11:20
  • `re.findall(r'(\d+([., ]?\d*)*)', s)` `[('1 ', ''), ('1.00 ', ''), ('100.00 ', ''), ('1,000 ', ''), ('1,000.00', '')]` how to return as list instead of tuples? – DougKruger Jul 01 '16 at 11:23
  • @KrugerBr, you need to use non-capturing groups. See my answer. – gaganso Jul 01 '16 at 11:33
1

To match currency only you can use : (\d[0-9,.]+)

and to match currency codes you can use : ([A-Z]+)

Demo and Explaination

Shekhar Khairnar
  • 2,643
  • 3
  • 26
  • 44