I have a series of strings that are all something like "Saturday, December 27th 2014" and I want to toss the "Saturday" and save the file with the name "141227" which is year + month + day. So far, everything is working except I can't get the regex for daypos or yearpos to work. They both give the same error:
Traceback (most recent call last): File "scrapewaybackblog.py", line 17, in daypos = byline.find(re.compile("[A-Z][a-z]*\s")) TypeError: expected a character buffer object
What's a character buffer object? Does that mean there's something wrong with my expression? Here's my script:
for i in xrange(3, 1, -1):
page = urllib2.urlopen("http://web.archive.org/web/20090204221349/http://www.americansforprosperity.org/nationalblog?page={}".format(i))
soup = BeautifulSoup(page.read())
snippet = soup.find_all('div', attrs={'class': 'blog-box'})
for div in snippet:
byline = div.find('div', attrs={'class': 'date'}).text.encode('utf-8')
text = div.find('div', attrs={'class': 'right-box'}).text.encode('utf-8')
monthpos = byline.find(",")
daypos = byline.find(re.compile("[A-Z][a-z]*\s"))
yearpos = byline.find(re.compile("[A-Z][a-z]*\D\d*\w*\s"))
endpos = monthpos + len(byline)
month = byline[monthpos+1:daypos]
day = byline[daypos+0:yearpos]
year = byline[yearpos+2:endpos]
output_files_pathname = 'Data/' # path where output will go
new_filename = year + month + day + ".txt"
outfile = open(output_files_pathname + new_filename,'w')
outfile.write(date)
outfile.write("\n")
outfile.write(text)
outfile.close()
print "finished another url from page {}".format(i)
I also haven't figured out how to make December = 12 but that's for another time. Just please help me find the right positions.