2

I'm writing some code that calculates certain statistics about word usages.

Does anyone know where I can find a database of raw news articles from various topics over a period of (say) the last year? Preferably they would be either in plain text format or XML. Trying to scrape content from random web sites isn't a good option.

I know going forward I could probably archive them myself. However, I need to kick start the process with a bunch of existing articles... the more the merrier.

Any other ideas for corpus data-sets that are easily available in simple to parse form would also be appreciated.

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
octonion
  • 759
  • 2
  • 8
  • 12

1 Answers1

0

You might try the Internet Archive. They have a text section but I don't know if it has news. You might also be able to use their Wayback machine to pull up news articles from major site using their RSS feeds.

DMKing
  • 1,705
  • 1
  • 10
  • 13
  • Thanks, those are nice ideas. To be honest I was a bit surprised not to have immediately found a raw dump of news articles ready to go just by Googling. I guess it must be copyright related... but then when did that ever stop anyone. – octonion Mar 01 '10 at 23:02
  • Someone else on the programming subreddit also suggested WikiNews. For what I'm doing, that might actually be more appropriate right now. Now I just need to figure out how to extract the articles from MediaWiki XML - hopefully shouldn't be too hard. – octonion Mar 04 '10 at 13:50