I've been searching on the web, and found media such as CNN and NPR provide links to access to their transcripts. To obtain them requires writing something like a crawler which is not so convenient. The reason is that I'm trying to use some transcripts of TV show, interview, radio, movie as training data in my natural language processing projects. So I'm wondering whether there's any collection or database freely available on the web so that I can download all of them at once without writing a crawler by myself?
Asked
Active
Viewed 1,942 times
3
-
2Hi Kelvin! Please let us know what kind of research you've already done. Also, note from here (http://stackoverflow.com/help/dont-ask) that some subjective questions are allowed, but that they should "invite sharing experiences over opinions" and "insist that opinion be backed up with facts and references" Also see #1 of the guidelines here (http://blog.stackoverflow.com/2010/09/good-subjective-bad-subjective/) on asking for recommendations. I politely disagree with @ThomasJungblut in that this is not a place to ask for recommendations. It should just be in an informed and informative manner. – arturomp Aug 27 '13 at 18:02
-
@ThomasJungblut So what do you think of these questions: http://stackoverflow.com/questions/3340810/twitter-social-networking-dataset http://stackoverflow.com/questions/4251768/twitter-public-dataset Instead of trying to put unhelpful and negative comments here, please focus on helping people get useful things done. – Kelvin Lee Aug 27 '13 at 22:16
-
1@ThomasJungblut There are lots of NLP related topics on SO, and having a good corpus is important part of the development process. The question is not about recommending a "best" corpus, it's about finding one that fits your task - nothing subjective. – Yasen Aug 29 '13 at 06:03
1 Answers
2
I would recommend the British National Corpus. I would also mention the American National Corpus, but the transcripts there are only of phone calls or face to face conversations - no news, tv shows, etc.
You also mentioned CNN and NPR. There are transcripts from 1996 as an LDC corpus here.

Yasen
- 1,663
- 10
- 17