3

The general problem

I am trying to understand how to prevent the existence of some pattern before or after a sought-out pattern when writing regex's!

A more specific example

I'm looking for a regex that will match dates in the format YYMMDD ((([0-9]{2})(0[1-9]|1[0-2])(0[1-9]|[1-2][0-9]|3[0-1]))) inside a long string while ignoring longer numeric sequences

it should be able to match:

  • text151124moretext
  • 123text151124moretext
  • text151124
  • text151124moretext1944
  • 151124

but should ignore:

  • text15112412moretext (reason: it has 8 numbers instead of 6)
  • 151324 (reason: it is not a valid date YYMMDD - there is no 13th month)

how can I make sure that if a number has more than these 6 digits, it won't picked up as a date inside one single regex (meaning, that I would rather avoid preprocessing the string) I've thought of \D((19|20)([0-9]{2})(0[1-9]|1[0-2])(0[1-9]|[1-2][0-9]|3[0-1]))\D but doesn't this mean that there has to be some character before and after?

I'm using bash 3.2 (ERE)

thanks!

Community
  • 1
  • 1
FotisK
  • 239
  • 3
  • 15
  • 1
    And if you use `(^|[^0-9])....([^0-9]|$)`? – Wiktor Stribiżew Jun 28 '17 at 21:34
  • What no assertions ? –  Jun 28 '17 at 21:48
  • 1
    There is no `bash` 3.4; did you mean 3.2 that comes with OS X, or 4.3 that you installed yourself? – chepner Jun 28 '17 at 21:49
  • @chepner, ah my mistake - it is 3.2 that came with OSX! thanks for pointing it out! – FotisK Jun 28 '17 at 21:53
  • @sin, that is a good catch! All these flavors of regex give me headache and being a beginner it's really daunting - It seems indeed that what I need is **negative lookahead**. Seems though that ERE doesn't support negative lookahead - isn't it? – FotisK Jun 28 '17 at 22:59

3 Answers3

1

Try:

#!/usr/bin/env bash

extract_date() {
    local string="$1"
    local _date=`echo "$string" | sed -E 's/.*[^0-9]([0-9]{6})[^0-9].*/\1/'`
    #date -d $_date &> /dev/null # for Linux
    date -jf '%y%m%d' $_date &> /dev/null # for MacOS
    if [ $? -eq 0 ]; then
        echo $_date
    else
        return 1
    fi
}


extract_date text15111224moretext # ignore n_digits > 6
extract_date text151125moretext # take
extract_date text151132 # # ignore day 32
extract_date text151324moretext1944 # ignore month 13
extract_date text150931moretext1944 # ignore 31 Sept
extract_date 151126 # take

Output:

151125
151126
glegoux
  • 3,505
  • 15
  • 32
  • thanks @gregoux, that's an interesting approach! Nonetheless I was looking for a singular regex that does that; reason is that I need to generalize. Now that I think of it, a better way of expressing my problem is: I am trying to understand how to prevent the existence of some pattern before or after a sought-out pattern when writing regex's! – FotisK Jun 28 '17 at 22:04
  • Then you can upvote ;). The problem for you is to manage cases like 31 Sept False and 28 Feb False some years. – glegoux Jun 28 '17 at 22:33
  • seems that **sed -r** does not work in OSX. There is **sed -E** that stands for Enhanced (not the same as Extended but in particular expression it suffices) (https://stackoverflow.com/a/12180129/3017323). **date** on OSX is also different; **date -d** does not work: I had to replace the whole line with `date -jf '%y%m%d' $_date &> /dev/null` - if you agree with the changes, you can integrate them in your answer and I will have it accepted! – FotisK Jun 29 '17 at 00:48
  • As a sidenote, `150931` passes the `date` evaluation (at least on OSX) and converts into 2015/10/01! In my case it's irrelevant though. – FotisK Jun 29 '17 at 19:17
0

If your tokens are line-separated (i.e. there is only one token per line):

^[\D]*[\d]{6}([\D]*|[\D]+[\d]{1,6})$

Basically, this regex looks for:

  • Any number of non-digits at the beginning of the string;
  • Exactly 6 digits
  • Any number of non-digits until the end OR at least one non-digit and at least one digit (up to 6) to the end of the string

This regex passes all of your given sample inputs.

Ted X
  • 33
  • 1
  • 6
  • thank you Ted X! However I can't seem to be able to use \D and \d with the [[ =~ ]] operator of bash; Operator =~ is limited to ERE - Did I get it wrong? – FotisK Jun 28 '17 at 22:51
0

You could use non-capturing groups to define non-digits either side of your date Regex. I had success with this expression and your same test data.

(?:\D)([0-9]{2})(0[1-9]|1[0-2])(0[1-9]|[1-2][0-9]|3[0-1])(?:\D)
Armstrong
  • 1
  • 2
  • Not sure if it's my mistake, but seems that ERE doesn't support non-capturing groups and \D,\d etc? – FotisK Jun 28 '17 at 22:59