7

I need a regex to get numeric values that can be

111.111,11

111,111.11

111,111

And separate the integer and decimal portions so I can store in a DB with the correct syntax

I tried ([0-9]{1,3}[,.]?)+([,.][0-9]{2})? With no success since it doesn't detect the second part :(

The result should look like:

111.111,11 -> $1 = 111111; $2 = 11
nhahtdh
  • 55,989
  • 15
  • 126
  • 162
LuRsT
  • 3,973
  • 9
  • 39
  • 51
  • just out of curiosity, why would you ever have a pattern such as: 11.111,111, that is the reverse of the actual value (111,111.11) – ennuikiller Aug 18 '09 at 17:33
  • Just to make this idiot proof. So that users don't have to remember what's the right pattern – LuRsT Aug 18 '09 at 17:35
  • That is actually quite smart, as there are many countries in the world using the comma as a decimal separator. For a list, check here: http://en.wikipedia.org/wiki/Decimal_separator#Countries_using_Arabic_numerals_with_decimal_comma – Håkon Aug 18 '09 at 18:36
  • What do you do with 111,111 or is that not allowed? – derobert Aug 18 '09 at 22:17
  • 111,111 = 111111. So it has no decimals :) – LuRsT Aug 19 '09 at 09:52
  • You could also interpret 111,111 as 111.111, so you would have to decide how to handle edge-cases. – Håkon Aug 19 '09 at 10:22
  • Yes, that's why the last part at least has to end with 2 digits, or nothing, @jpbochi final answer handles that. – LuRsT Aug 19 '09 at 10:53

5 Answers5

11

First Answer:

This matches #,###,##0.00:

^[+-]?[0-9]{1,3}(?:\,?[0-9]{3})*(?:\.[0-9]{2})?$

And this matches #.###.##0,00:

^[+-]?[0-9]{1,3}(?:\.?[0-9]{3})*(?:\,[0-9]{2})?$

Joining the two (there are smarter/shorter ways to write it, but it works):

(?:^[+-]?[0-9]{1,3}(?:\,?[0-9]{3})*(?:\.[0-9]{2})?$)
|(?:^[+-]?[0-9]{1,3}(?:\.?[0-9]{3})*(?:\,[0-9]{2})?$)

You can also, add a capturing group to the last comma (or dot) to check which one was used.


Second Answer:

As pointed by Alan M, my previous solution could fail to reject a value like 11,111111.00 where a comma is missing, but the other isn't. After some tests I reached the following regex that avoids this problem:

^[+-]?[0-9]{1,3}
(?:(?<comma>\,?)[0-9]{3})?
(?:\k<comma>[0-9]{3})*
(?:\.[0-9]{2})?$

This deserves some explanation:

  • ^[+-]?[0-9]{1,3} matches the first (1 to 3) digits;

  • (?:(?<comma>\,?)[0-9]{3})? matches on optional comma followed by more 3 digits, and captures the comma (or the inexistence of one) in a group called 'comma';

  • (?:\k<comma>[0-9]{3})* matches zero-to-any repetitions of the comma used before (if any) followed by 3 digits;

  • (?:\.[0-9]{2})?$ matches optional "cents" at the end of the string.

Of course, that will only cover #,###,##0.00 (not #.###.##0,00), but you can always join the regexes like I did above.


Final Answer:

Now, a complete solution. Indentations and line breaks are there for readability only.

^[+-]?[0-9]{1,3}
(?:
    (?:\,[0-9]{3})*
    (?:.[0-9]{2})?
|
    (?:\.[0-9]{3})*
    (?:\,[0-9]{2})?
|
    [0-9]*
    (?:[\.\,][0-9]{2})?
)$

And this variation captures the separators used:

^[+-]?[0-9]{1,3}
(?:
    (?:(?<thousand>\,)[0-9]{3})*
    (?:(?<decimal>\.)[0-9]{2})?
|
    (?:(?<thousand>\.)[0-9]{3})*
    (?:(?<decimal>\,)[0-9]{2})?
|
    [0-9]*
    (?:(?<decimal>[\.\,])[0-9]{2})?
)$

edit 1: "cents" are now optional; edit 2: text added; edit 3: second solution added; edit 4: complete solution added; edit 5: headings added; edit 6: capturing added; edit 7: last answer broke in two versions;

Mofi
  • 46,139
  • 17
  • 80
  • 143
jpbochi
  • 4,366
  • 3
  • 34
  • 43
  • +1. I would move the anchors outside the alternation. You could move the common leading and trailing elements outside it as well, but that's not necessarily worth the tradeoff in readability – Alan Moore Aug 18 '09 at 20:59
  • Readability is not a strong point of regular expressions, but I agree. Thanks for the vote :) – jpbochi Aug 18 '09 at 21:47
  • Just noticed, the thousands separators should *not* be optional; e.g., `(?:\.?[0-9]{3})*` should be `(?:\.[0-9]{3})*`. Otherwise, you could match things like `11,111111.00` or `1111.111,00`. – Alan Moore Aug 18 '09 at 23:36
  • Ok, but what if you want them to be optional? – jpbochi Aug 19 '09 at 02:59
  • Now, it's optional and doesn't have the problem you pointed. :) – jpbochi Aug 19 '09 at 03:25
  • Very nice! I wasn't even thinking about handling numbers with no thousands separators (since it's not in the question), but that's downright elegant. – Alan Moore Aug 19 '09 at 05:58
  • Oh, you meant **capturing**! I couldn't figure out what you (@LuRsT) meant by *backtracking*, but now I see you meant **capturing** all along. And again (@jpbochi), nicely done! You capture each separator (if there is one) in its own named group, so later you can remove all the thousands separators and split on the decimal separator. Unfortunately, it will only work in the **.NET**, **JGSoft**, or **Perl 5.10+** regex flavors; as of now, no others permit group names to be reused within a regex (which is a damn shame--that's a killer feature). – Alan Moore Aug 19 '09 at 13:12
  • Wow, great regex, but yes, I can't use it (I'm using php 5.3) can you make a version for that? Even If I have to search through the results to find the groups correctly :) – LuRsT Aug 19 '09 at 14:28
  • Also, backtrack in your final answer doesn't work for the first numbers :( – LuRsT Aug 19 '09 at 15:20
  • I didn't get it. Which first numbers are you talking about? – jpbochi Aug 19 '09 at 16:05
  • From the first line, or are they in the thousand group? – LuRsT Aug 19 '09 at 17:26
  • The 'thousand' group will only capture the separator character (`','` or `'.'`). I realized later that you wanted to capture the numbers themselves. I'm not sure it's possible with a raw regex. You may use the regex that I wrote to validate the string and capture the separators. Then, in a second step, you may split the digits and remove the separators. – jpbochi Aug 19 '09 at 19:26
  • You can replace the named groups with old-fashioned numbered groups, eg, `(\,)` instead of `(?\,)`. Then, if group 1 or 3 matched anything, that's your thousands separator (TS); if group 2, 4 or 5 matched, that's the decimal separator (DS). Delete all the TS's and split on the DS, and Bob's your uncle. (If none of them participated in the match, the number's an integer--no post-processing required.) – Alan Moore Aug 20 '09 at 03:50
3

I would at first use this regex to determine wether a comma or a dot is used as a comma delimiter (It fetches the last of the two):

[0-9,\.]*([,\.])[0-9]*

I would then strip all of the other sign (which the previous didn't match). If there were no matches, you already have an integer and can skip the next steps. The removal of the chosen sign can easily be done with a regex, but there are also many other functions which can do this faster/better.

You are then left with a number in the form of an integer possible followed by a comma or a dot and then the decimals, where the integer- and decimal-part easily can be separated from eachother with the following regex.

([0-9]+)[,\.]?([0-9]*)

Good luck!

Edit:

Here is an example made in python, I assume the code should be self-explaining, if it is not, just ask.

import re

input = str(raw_input())
delimiterRegex = re.compile('[0-9,\.]*([,\.])[0-9]*')
splitRegex = re.compile('([0-9]+)[,\.]?([0-9]*)')

delimiter = re.findall(delimiterRegex, input)

if (delimiter[0] == ','):
    input = re.sub('[\.]*','', input)
elif (delimiter[0] == '.'):
    input = re.sub('[,]*','', input)

print input

With this code, the following inputs gives this:

  • 111.111,11

    111111,11

  • 111,111.11

    111111.11

  • 111,111

    111,111

After this step, one can now easily modify the string to match your needs.

Håkon
  • 190
  • 3
  • 11
  • I'm pretty sure this answer is wrong, but I can't say for certain because you don't really say how you're using the regexes (but that's sufficient reason for a downvote right there). Can you explain how you're distinguishing the thousands separator from the decimal separator (with tested examples)? – Alan Moore Aug 18 '09 at 22:49
  • The first regex will determine what is the decimal separator by finding which of them that occurs last. You then strip the number of the other operator. And you will be left with a number without thousand separators. The rest should be piece of cake. Will post example-code later. – Håkon Aug 19 '09 at 10:20
  • According to the OP, the comma in `111,111` is a thousands separator (TS). A decimal separator (DS), if present, must be followed by exactly two digits (he cleared that up in the comments under the question). So your first regex would have to end with `([,.][0-9]{2})?` like the OP's did. But he's also trying to validate that the TS's are correctly distributed. – Alan Moore Aug 20 '09 at 12:57
1

How about

/(\d{1,3}(?:,\d{3})*)(\.\d{2})?/

if you care about validating that the commas separate every 3 digits exactly, or

/(\d[\d,]*)(\.\d{2})?/

if you don't.

Avi
  • 19,934
  • 4
  • 57
  • 70
0

If I'm interpreting your question correctly so that you are saying the result SHOULD look like what you say is "would" look like, then I think you just need to leave the comma out of the character class, since it is used as a separator and not a part of what is to be matched.

So get rid of the "." first, then match the two parts.

$value = "111,111.11";
$value =~ s/\.//g;
$value =~ m/(\d+)(?:,(\d+))?/;

$1 = leading integers with periods removed $2 = either undef if it didn't exist, or the post-comma digits if they do exist.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Devin Ceartas
  • 4,743
  • 1
  • 20
  • 33
0

See Perl's Regexp::Common::number.

Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339