2

I have 40k+ articles that have different segments in each. Each one exists as a Python dictionary with keys title, caegory,subcat,content, etc.

How can I create a corpora out of these while still maintaining a separation between the different subsections of each article, but still have that relation accessible for me to do manipulations with?

So when I'm done I'd to for example, grab all of the titles and do manipulations based just off of the other titles, but also be able to link each title back to the main content.

I want to do POS tagging on this and I don't want to mess it up by just combining all the subsections.

Hope that made sense.

Thanks.

edit:

The corpora isn't made yet. I'm going to make it out of this text. Here's what an entry in the DB looks like.

{'category': u'Pets',
 'content': u"<p>Putting your dog(s) in outdoor dog kennels might seem like a cruel thing to     do, but when you consider that they will be</p>.....",
 'signature': u'<p>Find out more on <a target="_new" href="http://petadore.com/outdoor-dog-            kennels-a-great-way-to-protect-your-dog-without-building-a-fence/">outdoor dog kennels</a> and r   read many interesting articles on <a target="_new" href="http://petadore.com/">pet health     care</a>.</p>',
 'subcat': u'Dogs',
 'title': u'Outdoor Dog Kennels & Enclosures'}

As you can see, it's HTML. I'd like to figure out a way to preserve the tags as well, so I could do tests on the text within the <li> or <b> tags for instance. That's in a perfect world though.

hippietrail
  • 15,848
  • 18
  • 99
  • 158
gEr
  • 215
  • 1
  • 2
  • 5
  • A small representative sample of the corpora text and the python representation would be immensely helpful. – Chris Eberle Aug 18 '11 at 04:13
  • Don't know what you ended up doing, but I'd have recommended making up a simple XML structure to code all your metadata: `
    Pets Dogs ...
    ...`. The NLTK can read XML-coded corpora. I would also store each article in a separate file, to make it easy to work with subsets.
    – alexis Apr 19 '12 at 14:50

0 Answers0