1

I have the following list:

l = [<div class="date">8 December 2004</div>,
 <div class="date">6 December 2004</div>,
 <div class="date">18 October 2004</div>,
 <div class="date">9 October 2004</div>,
 <div class="date">8 August 2004</div>,
 <div class="date">18 June 2004</div>,
 <div class="date">23 December 2005</div>,
 <div class="date">19 December 2005</div>,
 <div class="date">19 December 2005</div>,
 <div class="date">15 December 2005</div>]

I would like to convert it into a dataframe with a Date column in a to.datetime format.

I tried many solutions (see one below) but I couln't get my head around it.


pd.to_datetime(pd.DataFrame({'Date':l}), format = '%d %B %Y')        

Can anyone help me?

Thanks!

Rollo99
  • 1,601
  • 7
  • 15

2 Answers2

2

Extract text inside tags by BeautifulSoup and then convert to datetimes:

from bs4 import BeautifulSoup

df = pd.DataFrame({'Date':[ BeautifulSoup(x, features="lxml").text for x in l]})
df['Date'] = pd.to_datetime(df['Date'], format = '%d %B %Y')
print (df)
        Date
0 2004-12-08
1 2004-12-06
2 2004-10-18
3 2004-10-09
4 2004-08-08
5 2004-06-18
6 2005-12-23
7 2005-12-19
8 2005-12-19
9 2005-12-15
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

If you're scraping it with BeautifulSoup, you should be able to just call the following for your series.

pd.to_datetime(pd.Series([e.text for e in l]))

But if it's actually a string already, you'll need to extract the date out of the divs. Then you might want something like to remove the div tags:

import re
pd.to_datetime(pd.Series([re.sub(r'<\/?div.*?>', '', s) for s in l]))

Alternatively, you could extract the dates themselves using a regular expression perhaps like \d{1,2} \w+ \d{4}.

Nb that compilation is not necessary. For short scripts like most Pandas scripts, regular expressions are compiled and cached, according to the re module documentation.

The compiled versions of the most recent patterns passed to re.compile() and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.

ifly6
  • 5,003
  • 2
  • 24
  • 47