I have 40k+ articles that have different segments in each. Each one exists as a Python dictionary with keys title
, caegory
,subcat
,content
, etc.
How can I create a corpora out of these while still maintaining a separation between the different subsections of each article, but still have that relation accessible for me to do manipulations with?
So when I'm done I'd to for example, grab all of the titles and do manipulations based just off of the other titles, but also be able to link each title back to the main content.
I want to do POS tagging on this and I don't want to mess it up by just combining all the subsections.
Hope that made sense.
Thanks.
edit:
The corpora isn't made yet. I'm going to make it out of this text. Here's what an entry in the DB looks like.
{'category': u'Pets',
'content': u"<p>Putting your dog(s) in outdoor dog kennels might seem like a cruel thing to do, but when you consider that they will be</p>.....",
'signature': u'<p>Find out more on <a target="_new" href="http://petadore.com/outdoor-dog- kennels-a-great-way-to-protect-your-dog-without-building-a-fence/">outdoor dog kennels</a> and r read many interesting articles on <a target="_new" href="http://petadore.com/">pet health care</a>.</p>',
'subcat': u'Dogs',
'title': u'Outdoor Dog Kennels & Enclosures'}
As you can see, it's HTML. I'd like to figure out a way to preserve the tags as well, so I could do tests on the text within the <li>
or <b>
tags for instance. That's in a perfect world though.