I am trying to retrieve a total of 5000 tweets using the basic access API. I have been getting less than the 100 tweets asked for using the duplicate check in my code. I want to use the next_token param but I don't know how to implement it in this code so that the API doesn't look through the same set of tweets each time, wasting my requests.
Secondly, I want to extract tweets using both keywords (which I've done) and the users self-defined location e.g America and United Kingdom. How do I add the profile location filter to this search? (To clarify, I don't want the tweet geo-location but the location in the users bio.
Lastly, I have been getting truncated tweets in my data where a tweet would be cut short and this symbol '…' will come after.
I would appreciate any help with this. Thank you.
import tweepy
import pandas as pd
import config1
client = tweepy.Client(bearer_token=config1.bearer_token, wait_on_rate_limit=True)
# Define the query parameters
keywords = 'COVID or women'
start_date = '2023-07-22T02:00:00Z'
end_date = '2023-07-22T18:00:00Z'
# Create a list to store the extracted data
tweets_data = []
# Perform the search query for keyword
query = f'{keywords} lang:en -is:retweet -is:quote -has:media'
response = client.search_recent_tweets(query=query, max_results=100, start_time=start_date, end_time=end_date, tweet_fields=['id', 'text', 'created_at', 'public_metrics'], expansions=['geo.place_id'])
# Extract the desired information from each tweet
for tweet in response.data:
tweet_data = {
'Tweet ID': tweet['id'],
'Text': tweet['text'].encode('utf-8', 'ignore').decode('utf-8'),
'Public metrics': {
'retweet_count': tweet['public_metrics']['retweet_count'],
'reply_count': tweet['public_metrics']['reply_count'],
'like_count': tweet['public_metrics']['like_count']
},
'Created At': tweet['created_at'],
'Place': tweet['geo']
}
# Add the tweet data to the list
tweets_data.append(tweet_data)
# Create a DataFrame from the extracted tweet data
df = pd.DataFrame(tweets_data)
# Load the existing CSV file
existing_df = pd.read_csv('tweets_ex.csv', encoding='utf-8')
# Concatenate the existing DataFrame and the new DataFrame
updated_df = pd.concat([existing_df, df], ignore_index=True)
# Drop duplicate tweets based on the 'Tweet ID'
updated_df.drop_duplicates(subset='Text', inplace=True)
# Save the updated DataFrame to the CSV file
updated_df.to_csv('tweets_ex.csv', encoding='utf-8', index=False)
# Print the updated number of rows
print(f"The updated number of rows in the CSV file is: {len(updated_df)}")
When I tried the next_token it said rate limit exceeded and had an 800+ second sleep time which I eventually had to interrupt. This ended up not returning any tweet extracts but it was using up my tweets extracts. Where I only asked for 35 tweets it showed on my dev portal that 2100 tweets had been pulled!
For the truncated tweets I tried tweet mode = extended but it not compatible with v2 i think
# Perform the search query for keyword and pagination
next_token = None
while True:
query = f'{keywords} lang:en -is:retweet -is:quote -has:media'
response = client.search_recent_tweets(
query=query,
max_results=35,
start_time=start_date,
end_time=end_date,
tweet_fields=['id', 'text', 'created_at', 'public_metrics'],
expansions=['geo.place_id'],
next_token=next_token
)
for tweet in response.data:
tweet_data = {
'Tweet ID': tweet['id'],
'Text': tweet['text'].encode('utf-8', 'ignore').decode('utf-8'),
'Public metrics': {
'retweet_count': tweet['public_metrics']['retweet_count'],
'reply_count': tweet['public_metrics']['reply_count'],
'like_count': tweet['public_metrics']['like_count']
},
'Created At': tweet['created_at'],
'Place': tweet['geo']
}
tweets_data.append(tweet_data)
if 'next_token' in response.meta:
next_token = response.meta['next_token']
else:
break