0

I am using python in scrapy and collecting a bunch of dates that are stored on a web page in the form of text strings like "11th November" (no year is provided).

I was trying to use

startdate = '11th November'
datetime.strptime(startdate, '%d %B')

but I don't think it likes the 'th' and I get a

Value error: time data '11th November' does not match format '%d %B'

If I make a function to try to strip out the th, st, rd, nd from the days I figured it will strip out the same text from the month.

Is there a better way to approach turning this into a date format?

For my use, it ultimately needs to be in the ISO 8601 format YYYY-MM-DD

This is so that I can pipe it from scrapy to a database, and from that use it in a Google Spreadsheet for a javascript Google chart. I just mention this because there may be a better place to make the string-to-date change than trying to do it in python.

(As a secondary issue, I also need to figure how to add the right year to the date given that if it says 12th January that would mean Jan 2020 and not 2019. This will be based on a comparison to the date when the scrape runs. i.e. the date today.)

EDIT: it turned out that the solution required the secondary issue to be addressed as well. Hence the choice of final answer to this question. If the secondary issue of the year was not addressed it defaulted to 1900 which was a problem.

mdkb
  • 372
  • 1
  • 14
  • I don't see an option for handling the 'th' at https://www.journaldev.com/23365/python-string-to-datetime-strptime so you may have to deal with changing that format – oppressionslayer Nov 12 '19 at 04:34
  • Does this answer your question? [How to get the datetime from a string containing '2nd' for the date in Python?](https://stackoverflow.com/questions/28091947/how-to-get-the-datetime-from-a-string-containing-2nd-for-the-date-in-python) – razdi Nov 12 '19 at 04:39
  • @razdi yes at least one of them did. not sure how to add your comment as the answer since I cannot upvote comments yet, and you answered it first – mdkb Nov 12 '19 at 04:58

1 Answers1

1

Try this out -

import datetime
datetime_obj = datetime.datetime.strptime(re.sub(r"\b([0123]?[0-9])(st|th|nd|rd)\b",r"\1", startdate) + " " + str(datetime.datetime.now().year), "%d %B %Y")  
Sushant
  • 3,499
  • 3
  • 17
  • 34
  • This solution worked to address both my questions: Removal of the day text, and also putting the year in to the end result. Without both being solved the datetime.strptime was setting the year to 1900 by default which was no good on it's own. The only change I had to make to this solution was I did not need to use datetime.datetime.strptime, only datetime.strptime and datetime.now() then it worked. – mdkb Nov 12 '19 at 05:06
  • Yes, usage of `datetime` depends on how you have imported it. You must have imported `from datetime import datetime` – Sushant Nov 12 '19 at 05:08
  • yes, that is how I have my import in the code. that explains it. – mdkb Nov 12 '19 at 05:12