0

I have a html document with many lines like:

<option value="29">Soil Temperature (<i>2002-10-17</i>)</option>

or like:

<option value="STO">Soil &amp; Air Temperature (2002-10-17)</option>

For each line, I want to check if the line contains the date in the YYYY-mm-dd format and if it does, I want to extract the date.

The following code does not work:

line = '<option value="29">Soil Temperature (<i>2002-10-17</i>)</option>'
date = datetime.datetime.strptime(line, '%Y-%m-%d')

It gives me the error:

ValueError: time data '<option value="29">Soil Temperature (<i>2002-10-17</i>)</option>' does not match format '%Y-%m-%d'

Any way how to easily extract the date?

jirikadlec2
  • 1,256
  • 1
  • 23
  • 36

2 Answers2

3

You can use the following pattern:

\b\d{4}-\d\d?-\d\d?\b

>>> import datetime
>>> import re
>>>
>>> line = '<option value="29">Soil Temperature (<i>2002-10-17</i>)</option>'
>>> dt_list = re.findall(r'\b\d{4}-\d\d?-\d\d?\b', line)
>>> [datetime.datetime.strptime(dt, '%Y-%m-%d') for dt in dt_list]
[datetime.datetime(2002, 10, 17, 0, 0)]

NOTE: You should escape \ or use raw string literal as shown in the above example. Otherwise it will interpreted as a escape sequence. Especially \b will be interpreted as a BACKSPACE instead of word boundary.

falsetru
  • 357,413
  • 63
  • 732
  • 636
3

Alternatively, you can use a BeautifulSoup HTML parser in conjunction with dateutil:

from bs4 import BeautifulSoup
from dateutil.parser import parse


data = """
<select>
    <option value="29">Soil Temperature (<i>2002-10-17</i>)</option>
    <option value="STO">Soil &amp; Air Temperature (2002-10-17)</option>
</select>
"""

soup = BeautifulSoup(data)
for option in soup('option'):
    print parse(option.text, fuzzy=True)

Prints datetime objects:

2002-10-17 00:00:00
2002-10-17 00:00:00

Note that fuzzy parsing has a bit surprising behavior - if the date is not found in the string, it would return the current date - see Trouble in parsing date using dateutil.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195