0

I'm using this method exactly, but when I try to specify just english with lang="en" and every other variation of that I could think of it doesn't work. This is what I'm putting in (even with keywords to limit it further) and it still isn't giving me just English. I've tried with and without keywords. I'm trying to build a 200,000+ Tweet searchable control corpus in only English for a research project and I do not want to go through that many Tweets by hand. Ideas?

>>> from nltk.twitter import Twitter
>>> tw = Twitter()
>>> tw.tweets(keywords='Delicacy, reptile, death, hold, dark, column, gifted, surgeon, brave, fashion, pearl, diamond, bent, sparkle, present, missing, shadow, holiday, glide, scanner, luster, immunity, devour, discipline, barbaric, fortunate, heart, puzzle, ache, crystal', 
        limit=10000, lang="en", to_screen=False)
Writing to /Users/rhiannalavalla/twitter-files/tweets.20170521-235221.json
Written 10000 Tweets
alexis
  • 48,685
  • 16
  • 101
  • 161
rlavalla
  • 1
  • 2

1 Answers1

0

The lang option is passed to the twitter search API, so you're requesting "English" tweets. But have you used twitter? You don't have to declare the language of each and every tweet, so twitter can't restrict your results with accuracy. The lang option evidently matches the authors's choice of language for their UI, not the language of the individual tweets.

To restrict your results to tweets in English, search by hashtags and/or user ids that are likely to be of interest to English speakers only (the specifics will depend on what your corpus is for). Alternately (or perhaps in addition), you can try an automated language identification algorithm to filter out suspect tweets. The nltk comes with the langid corpus of language trigram statistics, which you could use to train a recognizer.

alexis
  • 48,685
  • 16
  • 101
  • 161