6

I am writing lexer rules for a custom description language using pyLR1 which shall include time literals like for example:

10h30m     # meaning 10 hours + 30 minutes
5m30s      # meaning 5 minutes + 30 seconds
10h20m15s  # meaning 10 hours + 20 minutes + 15 seconds
15.6s      # meaning 15.6 seconds

The order of specification for hour, minute and second parts shall be fixed to h, m, s. To specify this in detail, I want the following valid combinations hms, hm, h, ms, m and s (with numbers between the different segments of course). As a bonus the regex should check for decimal (i.e. non-natural) numbers in the segments and only allow these in the segment with least significance.

So I have for all but the last group a number match like:

([0-9]+)

And for the last group even:

([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)  # to allow for .5 and 0.5 and 5.0 and 5

Going through all the combinations of h, m and s a cute little python script gives me the following regex:

(([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)h|([0-9]+)h([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)m|([0-9]+)h([0-9]+)m([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)s|([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)m|([0-9]+)m([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)s|([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)s) 

Obviously, this is a little bit of horror expression. Is there any way to simplify this? The answer must work with pythons re module and I will also accept answers which do not work with pyLR1 if its due to its restricted subset of regular expressions.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Jonas Schäfer
  • 20,140
  • 5
  • 55
  • 69

5 Answers5

3

You can factorise your regular expression, using the notation h, m, s to denote each of the subregexes, the most basic version is:

h|hm|hms|ms|m|s

which is what you have currently. You can break this into:

(h|hm|hms)|(ms|m)|s

and then pulling out h from the first expression and m from the second we get (using (x|) == x?):

h(m|ms)?|ms?|s

Continuing on we get to

h(ms?)?|ms?|s

which is probably simpler (and probably the simplest).


Adding in the regex d to denote decimals (as in \.[0-9]+), this could be written as

h(d|m(d|sd?)?)?|m(d|sd?)?|sd?

(i.e. at each stage optionally have either decimals, or a continuation to the next of h m or s.)

This would result in something like (for just hours and minutes):

[0-9]+((\.[0-9]+)?h|h[0-9]+(\.[0-9]+)?m)|[0-9]+(\.[0-9]+)?m

Looking at this, it might not be possible to get into a form ameniable for pyLR1, so doing the parsing with decimals in every spot and then a secondary check might be the best way to do this.

huon
  • 94,605
  • 21
  • 231
  • 225
  • 1
    This does not cover the point with allowing decimal values (only) in the least significant segment, or am I wrong? I could, however, implement that in a later check (i.e. having matched the RE, I can try to cast the non-least-significant segments to integer). – Jonas Schäfer Jul 02 '12 at 12:07
  • 3
    Since there's the desire to distinguish the least significant value specially, I'd think about `(h?m)?S|h?M|H` (where the capitalized version is the one that allows decimals). – Donal Fellows Jul 02 '12 at 12:23
  • @DonalFellows, that's a much better way: factorise from the right instead of from the left. :) – huon Jul 02 '12 at 12:24
  • (Also, @JonasWielicki, I've added in a possible solution for the decimals, it's not very nice though...) – huon Jul 02 '12 at 12:26
  • +1 by me, I like the factorisation logic of both dbaupp and @DonalFellows – c00kiemon5ter Jul 02 '12 at 12:35
  • 1
    @DonalFellows: Your solution seems to work pretty nice. I'd like to give you credit for this one, but I'm afraid right now I cannot do that except by upvoting your comment (is there a way to collaborate on an answer credit-wise on SO?) ;). It aborts parsing after the first decimal, which is not perfect, but it is fine for me. – Jonas Schäfer Jul 02 '12 at 12:37
  • 2
    @Jonas I don't need the rep; it was just an observation based on dbaupp's answer. Myself, I'd actually try to avoid using REs for anything other than the first level of parse, leaving higher-level constraints until later in the tokenizer. – Donal Fellows Jul 03 '12 at 13:12
1

the below representation should be understandable, I dont know the exact regex syntax you're using, so you have to "translate" to the valid syntax yourself.

your hours

 [0-9]{1,2}h

your minutes

[0-9]{1,2}m

your seconds

[0-9]{1,2}(\.[0-9]{1,3})?s

you want all those in order, and able to omit any of those (wrap with ?)

([0-9]{1,2}h)?([0-9]{1,2}m)?([0-9]{1,2}(\.[0-9]{1,3})?s)?

this however matches things like: 10h30s
that is valid combinations are hms, hm, hs, h, ms, m and s
or iow, minutes can be ommited, but still have hours and seconds.

the other problem is if the empty string is given, it is matched, as all three ? make that valid. so you have to work around this somehow. hmm


looking at @dbaupp h(ms?)?|ms?|s you can take the above and match:

h: [0-9]{1,2}h
m: [0-9]{1,2}m
s: [0-9]{1,2}(\.[0-9]{1,3})?s

so you get to:

h(ms?)?: ([0-9]{1,2}h([0-9]{1,2}m([0-9]{1,2}(\.[0-9]{1,3})?s)?)?
  ms?  :              [0-9]{1,2}m([0-9]{1,2}(\.[0-9]{1,3})?s)?
   s   :                          [0-9]{1,2}(\.[0-9]{1,3})?s

all those OR'd together give you a big but easy to break down regex:

([0-9]{1,2}h([0-9]{1,2}m([0-9]{1,2}(\.[0-9]{1,3})?s)?)?|[0-9]{1,2}m([0-9]{1,2}(\.[0-9]{1,3})?s)?|[0-9]{1,2}(\.[0-9]{1,3})?s

which get you away with both the empty string problem and the match of hs.


looking at @Donal Fellows comment on @dbaupp answer, I'll also do (h?m)?S|h?M|H

(h?m)?s: (([0-9]{1,2}h)?[0-9]{1,2}m)?[0-9]{1,2}(\.[0-9]{1,3})?s
 h?m   :  ([0-9]{1,2}h)?[0-9]{1,2}m
 h     :   [0-9]{1,2}h

and merged together, you end up with something smaller than the above:

(([0-9]{1,2}h)?[0-9]{1,2}m)?[0-9]{1,2}(\.[0-9]{1,3})?s|([0-9]{1,2}h)?[0-9]{1,2}m|[0-9]{1,2}h

now we have to find a way to match .xx demical representation

c00kiemon5ter
  • 16,994
  • 7
  • 46
  • 48
  • Matching emptystring is a huge problem I'm afraid. It breaks the whole lexing stage. For other purposes this would probably be fine though, if you add the neccessary magic to get decimals to work for things like `10h15.2m`. – Jonas Schäfer Jul 02 '12 at 12:16
  • I just saw `to allow for .5`. I will think about it – c00kiemon5ter Jul 02 '12 at 12:17
1

Here is a short Python expression that works:

(\d+h)?(\d+m)?(\d*\.\d+|\d+(\.\d*)?)(?(2)s|(?(1)m|[hms]))

Inspired by Cameron Martins answer based on conditionals.

Explained:

(\d+h)?                 # optional int "h" (capture 1)
(\d+m)?                 # optional int "m" (capture 2)
(\d*\.\d+|\d+(\.\d*)?)  # int or decimal 
(?(2)                   # if "m" (capture 2) was matched:
  s                       # "s"
| (?(1)                 # else if "h" (capture 1) was matched:
  m                       # "m"
|                       # else (nothing matched):
  [hms]))                 # any of the "h", "m" or "s"
Community
  • 1
  • 1
Qtax
  • 33,241
  • 9
  • 83
  • 121
0

You may have hours, minutes, and seconds.

    /(\d{1,2}h)*(\d{1,2}m)*(\d{1,2}(\.\d+)*s)*/

should do the work. Depending on the regex library, you will get your items in order, or you will have to parse them further to check for h, m or s.

In this latter case, see also what is returned by

   /(\d{1,2}(h))*(\d{1,2}(m))*(\d{1,2}(\.\d+)*(s))*/
LSerni
  • 55,617
  • 10
  • 65
  • 107
  • Doesn't this parse `10h10h10h`? – huon Jul 02 '12 at 12:01
  • @dbaupp it does :(. Such a literal shall syntax-error while lexing/parsing the language. – Jonas Schäfer Jul 02 '12 at 12:01
  • Of course you have to bind this to the whole value: ^...$ - otherwise it would also parse "It is now 10h30m, good morning" :-) . Add a ^ at the beginning and a $ at the end to match the full value. – LSerni Jul 02 '12 at 12:05
  • It is an regex in a lexer. The lexer cannot know that it has to abort after the first `h` if that's not in your regex. – Jonas Schäfer Jul 02 '12 at 12:06
0

The last group should be:

([0-9]*\.[0-9]+|[0-9]+(\.[0-9]+)?)

unless you want to match 5.


You could use regex ifs, like so:

(([0-9]+h)?([0-9]+m)?([0-9]+s)?)(?(?<=h)(([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)m)?|(?(?<=m)(([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)s)?|\b(([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)[hms])?))

Here - http://regexr.com?31dmj

I havn't checked that this works, but it trys to match just integers for hours, minutes, then seconds first, then if the last thing matched is hours, it allows fractional minutes, otherwise if the last thing matched is minutes, it allows fractional seconds.

Cameron Martin
  • 5,952
  • 2
  • 40
  • 53
  • Your expression does not compile with pythons `re` module (and I'm fine with matching `5.`). – Jonas Schäfer Jul 02 '12 at 12:13
  • I said I havn't checked that it works, but the idea is there. Read the link I gave you. – Cameron Martin Jul 02 '12 at 12:14
  • And it wouldn't work with `hms` (when `s` has the decimal form), but good idea. – Qtax Jul 02 '12 at 12:17
  • Yes it would/should (if I've got the if syntax correct). It will match hours and minutes in integer form, fail on the optional integer seconds, then the if statement would check the last thing matched was `m`, and then try to match fractional seconds. – Cameron Martin Jul 02 '12 at 12:20
  • Nope, in your expression there is no decimal form for `s` at all. – Qtax Jul 02 '12 at 12:23
  • Next you should fix the decimal form hours, when only `h` is used that is. And conditionals work in Python 2.4 if used with backrefs, so you could make it work. – Qtax Jul 02 '12 at 12:31