4

I have a list of strings

my_strings = [
    "2002-03-04 with Matt",
    "Important: 2016-01-23 with Mary",
    "with Tom on 2015-06-30",
]

I want to extract:

  • date (always in yyyy-mm-dd format)
  • person (always in with person) but I don't want to keep "with"

I could do:

import re
pattern = r'.*(\d{4}-\d{2}-\d{2}).*with \b([^\b]+)\b.*'
matched = [re.match(pattern, x).groups() for x in my_strings]

but it fails because pattern doesn't match "with Tom on 2015-06-30".

Questions

How do I specify the regex pattern to be indifferent to the order in which date or person appear in the string?

and

How do I ensure that the groups() method returns them in the same order every time?

I expect the output to look like this?

[('2002-03-04', 'Matt'), ('2016-01-23', 'Mary'), ('2015-06-30', 'Tom')]
piRSquared
  • 285,575
  • 57
  • 475
  • 624

4 Answers4

4

What about doing it with 2 separate regex?

my_strings = [
    "2002-03-04 with Matt",
    "Important: 2016-01-23 with Mary",
    "with Tom on 2015-06-30",
]
import re

pattern = r'.*(\d{4}-\d{2}-\d{2})'
dates = [re.match(pattern, x).groups()[0] for x in my_strings]

pattern = r'.*with (\w+).*'
persons = [re.match(pattern, x).groups()[0] for x in my_strings]

output = zip(dates, persons)
print output
## [('2002-03-04', 'Matt'), ('2016-01-23', 'Mary'), ('2015-06-30', 'Tom')]
Julien Spronck
  • 15,069
  • 4
  • 47
  • 55
2

This should work:

my_strings = [
    "2002-03-04 with Matt",
    "Important: 2016-01-23 with Mary",
    "with Tom on 2015-06-30",
]

import re

alternates = r"(?:\b(\d{4}-\d\d-\d\d)\b|with (\w+)|.)*"

for tc in my_strings:
    print(tc)
    m = re.match(alternates, tc)
    if m:
        print("\t", m.group(1))
        print("\t", m.group(2))

Output is:

$ python test.py
2002-03-04 with Matt
     2002-03-04
     Matt
Important: 2016-01-23 with Mary
     2016-01-23
     Mary
with Tom on 2015-06-30
     2015-06-30
     Tom

However, something like this is not totally intuitive. I encourage you to try using named groups if at all possible.

aghast
  • 14,785
  • 3
  • 24
  • 56
  • Named groups is great. Thank you, I learned something very useful. – piRSquared May 09 '16 at 19:45
  • The only problem with this _out-of_order_ method is it will match both or one or the other with a missing part. This could be done using conditionals with the _regex_ module which does out of order but requires both parts. It really does not good this way unless there is the implication of guaranteed parts, or it just is not that important. –  May 09 '16 at 19:56
2

Just for education reasons, a non-regex approach could involve using dateutil parser in a "fuzzy" mode to extract the dates and the nltk toolkit with the named entity recognition to extract names. Complete code:

import nltk
from nltk import pos_tag, ne_chunk
from nltk.tokenize import SpaceTokenizer
from dateutil.parser import parse


def extract_names(text):
    tokenizer = SpaceTokenizer()
    toks = tokenizer.tokenize(text)
    pos = pos_tag(toks)
    chunked_nes = ne_chunk(pos)

    return [' '.join(map(lambda x: x[0], ne.leaves())) for ne in chunked_nes if isinstance(ne, nltk.tree.Tree)]

my_strings = [
    "2002-03-04 with Matt",
    "Important: 2016-01-23 with Mary",
    "with Tom on 2015-06-30"
]

for s in my_strings:
    print(parse(s, fuzzy=True))
    print(extract_names(s))

Prints:

2002-03-04 00:00:00
['Matt']
2016-01-23 00:00:00
['Mary']
2015-06-30 00:00:00
['Tom']

That's probably an over-complication though.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
2

If you use Python's new regex module, you can use conditionals to get
a guaranteed match on 2 items.

I'd think this is more like a standard to do out-of-order matching.

(?:.*?(?:(?(1)(?!))\b(\d{4}-\d\d-\d\d)\b|(?(2)(?!))with[ ](\w+))){2}

Expanded

 (?:
      .*? 
      (?:
           (?(1)(?!))
           \b 
           ( \d{4} - \d\d - \d\d )       # (1)
           \b 
        |  (?(2)(?!))
           with [ ] 
           ( \w+ )                       # (2)
      )
 ){2}
piRSquared
  • 285,575
  • 57
  • 475
  • 624