As part of my Master's Degree, I need to collect data from Twitter for future machine learning models.
What's the problem?
I am trying to get tweets with a given hashtag (#), something really simple such as #climatechange, so as I understood from other questions at stack overflow, I need to add q parameter and pass the "#climatechange" string there.
Here is the code:
# Loads JSON Credentials.
twitter_credentials_json = load_twitter_credentials('TwitterCredentials.json')
# Creates tweepy.API object.
auth = tweepy.OAuthHandler(twitter_credentials_json['consumer_key'], twitter_credentials_json['consumer_secret'])
auth.set_access_token(twitter_credentials_json['access_token'], twitter_credentials_json['access_token_secret'])
api = tweepy.API(auth, wait_on_rate_limit=True)
data_list = []
# Iterates through the required tweets and adds them to the list.
for tweet in tweepy.Cursor(api.search, q="#climatechange", since="2020-01-01", until="2020-10-01").items(100):
data_list.append(tweet._json)
# Drops everything to the file system.
with open(f"Tweets {get_datetime_as_string()}.json", 'w', encoding='utf8') as outfile:
outfile.write(json.dumps(data_list))
outfile.close()
As you can see I am searching at Twitter, I require every text that contains the string "#climatechange", since 2020-01-01, until 2020-10-01, and I take 100 items. Now I open the JSON file and I see some unrelated tweets in the JSON file, that doesn't contain "#climatechange" text. I decided to check at the whole object that I received from tweepy and there is also no mention for "#climatechange" string anywhere.
For example:
"text": "RT @BetteMidler: The #GOP cannot govern. Remember they presided over #9-11, the #IraqWar, the 2008 #GreatRecession, & when they returned t\u2026"
"text": "RT @DeWayne_Fulton: #Texas can lead the way in energy innovation--safe, clean, efficient, renewable energy.\n\n@Lizzie4Congress knows that th\u2026",
To summarize it until now:
- I get tweets from twitter by specific conditions.
- I save them to the file system.
- I open the JSON file and about 10% of the tweets don't have the "#climatechange" string in them.
What I tried to solve this issue?
Of course, the first thing I tried to do is going to tweepy official documentation for the Cursor object but I didn't find anything useful there, I didn't even find the q parameter or anything else, although many stack overflow solutions use those parameters. http://docs.tweepy.org/en/v3.9.0/cursor_tutorial.html It seems like the documentation isn't fully written or missing a lot of stuff, where did I go wrong with the documentation?
I searched at Stack Overflow and some more sites if someone had this issue too, but I didn't find anything relevant.
I searched for tweepy.Cursor solutions at StackOverflow to adjust my parameters and I tried adding someone parameters, removing some but still, nothing.
I tried going to tweepy.Cursor GitHub code to understand how it works but I didn't fully understand how it works so no success there.
As I understand once I specify the "q" parameter with some string it will search for strings that contain this query parameter and return only the valid tweets, but as I see it there is some problem and it returns unrelated tweets.
I will be happy to get some help or maybe if you can tell me what I miss, I am sure it's something small that I miss and that's the reason I don't get the right data.
Thanks.