-4

I want to extract date from a sentence.

These are the valid date types

dd.mm.yy
dd.mm.yyyy
d.m.yy
d.m.yy
dd-mm-yy
dd-mm-yyyy
dd/mm/yy
dd/mm/yyyy

The following regex does the job well.

^(?:(?:31(\/|-|\.)(?:0?[13578]|1[02]))\1|(?:(?:29|30)(\/|-|\.)(?:0?[13-9]|1[0-2])\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:29(\/|-|\.)0?2\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\d|2[0-8])(\/|-|\.)(?:(?:0?[1-9])|(?:1[0-2]))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$

I tested it on multiple online regex testers such as https://www.regexpal.com/

Then I tried it in python with the following code, which could not extract the date portion.

def validate_date(text):
    date_regex = '^(?:(?:31(\/|-|\.)(?:0?[13578]|1[02]))\1|(?:(?:29|30)(\/|-|\.)(?:0?[13-9]|1[0-2])\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:29(\/|-|\.)0?2\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\d|2[0-8])(\/|-|\.)(?:(?:0?[1-9])|(?:1[0-2]))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$'
    return re.findall(date_regex, text)


date = validate_date("02.02.2020")
print(date)

What is the reason for this behavior?

Malintha
  • 4,512
  • 9
  • 48
  • 82
  • 1
    Please add example text to the question. – alani Jul 19 '20 at 10:16
  • 2
    Include a Minimal, Reproducible Example – sushanth Jul 19 '20 at 10:16
  • 1
    Why is the regex so complicated ? Please explain the case you want to handle . You'd better split in 3 different regex as I see 3 `|` OR – azro Jul 19 '20 at 10:23
  • @azro add the valid cases – Malintha Jul 19 '20 at 10:27
  • 1
    I'd say it would be more easy to try/except the parsing into a date with the different format, it all fails, it fail, less performant but much more readable – azro Jul 19 '20 at 10:28
  • my question is the reason for not working in python but online. not the complexity or performance. sorry – Malintha Jul 19 '20 at 10:31
  • 1
    @azro - the regex is so complicated because it handles things like month lengths (so it won't match "31.4.2020") and checking that the separators match (so it won't match "30-4.2020"). I don't think it checks leap years, though. – Jiří Baum Jul 19 '20 at 10:42

3 Answers3

1

Please append prefix r before regex expression.

i.e.

date_regex = r'^(?:(?:31(\/|-|\.)(?:0?[13578]|1[02]))\1|(?:(?:29|30)(\/|-|\.)(?:0?[13-9]|1[0-2])\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:29(\/|-|\.)0?2\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\d|2[0-8])(\/|-|\.)(?:(?:0?[1-9])|(?:1[0-2]))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$'

Here, the r prefix on the string literal is superfluous. However, it is conventionally used for regular expression literals. r'' (and other r prefixed Python quoting forms) are for defining "raw" strings ... which is to say that they are strings for which there is (almost) no evaluation of the string (for \ character sequences).

Solution:

import re
def validate_date(text):
    date_regex = r'^(?:(?:31(\/|-|\.)(?:0?[13578]|1[02]))\1|(?:(?:29|30)(\/|-|\.)(?:0?[13-9]|1[0-2])\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:29(\/|-|\.)0?2\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\d|2[0-8])(\/|-|\.)(?:(?:0?[1-9])|(?:1[0-2]))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$'
    return re.finditer(date_regex, text)


date = validate_date("02.02.2020")
for match in date:
    print(match.group())
    # match start: match.start()
    # match end (exclusive): match.end()
    # matched text: match.group()
Ashish Karn
  • 1,127
  • 1
  • 9
  • 20
  • this returns give me [('', '', '', '.')] not the date – Malintha Jul 19 '20 at 10:33
  • 2
    That's right; `findall` returns the list of tuples of capturing groups, not the full matches. If you test on regex101.com you'll see the groups that are matched are in fact the separators. You can either (a) put a capturing group for the whole date, or (b) use one of the other functions (eg. `finditer`) to get the whole matches rather than the captured groups. – Jiří Baum Jul 19 '20 at 10:40
  • 1
    @sabik Thanks I have added `finditer` with some code changes. – Ashish Karn Jul 19 '20 at 10:47
1

Two Issues

  1. Need to use raw rawstring r'...' in regex pattern
  2. re.search works not re.findall in this case see answer to why does findall find nothing, but search works?. So can use search to find first occurrence of date in string

Code

def validate_date(text):
    date_regex = r'^(?:(?:31(\/|-|\.)(?:0?[13578]|1[02]))\1|(?:(?:29|30)(\/|-|\.)(?:0?[13-9]|1[0-2])\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:29(\/|-|\.)0?2\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\d|2[0-8])(\/|-|\.)(?:(?:0?[1-9])|(?:1[0-2]))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$'

    return re.search(date_regex, text)

Test

date = validate_date("02.02.2020")
print(date.group())
# Output: 02.02.2020
DarrylG
  • 16,732
  • 2
  • 17
  • 23
  • I write this comment for two reasons, first of all thank you for the correct answer. Also, thank you very much for answer to the exact point of the question, not other unrelated issues. You might see the comments and other answers. I downvoted some answers which results a downvote against the question and answer deletions. Thanks again for your time. – Malintha Jul 19 '20 at 10:49
  • @DarryIG Your output is not correct. # Output: 02.02.2020 You will get search Obejct as `<_sre.SRE_Match object; span=(0, 10), match='02.02.2020'>` You need to add `.group()` – Ashish Karn Jul 19 '20 at 10:53
0

When writing regexes, use the r'...' form of strings:

    date_regex = r'^(?:(?:31(\/|-|\.)(?:0?[13578]|1[02]))\1|(?:(?:29|30)(\/|-|\.)(?:0?[13-9]|1[0-2])\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:29(\/|-|\.)0?2\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\d|2[0-8])(\/|-|\.)(?:(?:0?[1-9])|(?:1[0-2]))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$'

That makes sure that backslashes (\) are interpreted as part of the regex (rather than in the way they're treated in other strings).

Jiří Baum
  • 6,697
  • 2
  • 17
  • 17
  • 1
    give me [('', '', '', '.')] not the date – Malintha Jul 19 '20 at 10:32
  • That's right; `findall` returns the list of tuples of capturing groups, not the full matches. If you test on regex101.com you'll see the groups that are matched are in fact the separators. You can either (a) put a capturing group for the whole date, or (b) use one of the other functions (eg. `finditer`) to get the whole matches rather than the captured groups. – Jiří Baum Jul 19 '20 at 10:44