0

I am trying to retrieve a total of 5000 tweets using the basic access API. I have been getting less than the 100 tweets asked for using the duplicate check in my code. I want to use the next_token param but I don't know how to implement it in this code so that the API doesn't look through the same set of tweets each time, wasting my requests.

Secondly, I want to extract tweets using both keywords (which I've done) and the users self-defined location e.g America and United Kingdom. How do I add the profile location filter to this search? (To clarify, I don't want the tweet geo-location but the location in the users bio.

Lastly, I have been getting truncated tweets in my data where a tweet would be cut short and this symbol '…' will come after.

I would appreciate any help with this. Thank you.

import tweepy
import pandas as pd
import config1

client = tweepy.Client(bearer_token=config1.bearer_token, wait_on_rate_limit=True)

# Define the query parameters
keywords = 'COVID or women'
start_date = '2023-07-22T02:00:00Z'
end_date = '2023-07-22T18:00:00Z'

# Create a list to store the extracted data
tweets_data = []

    # Perform the search query for keyword
query = f'{keywords} lang:en -is:retweet -is:quote -has:media'

response = client.search_recent_tweets(query=query, max_results=100, start_time=start_date, end_time=end_date, tweet_fields=['id', 'text', 'created_at', 'public_metrics'], expansions=['geo.place_id'])
    
    # Extract the desired information from each tweet
for tweet in response.data:
    tweet_data = {
        'Tweet ID': tweet['id'],
        'Text': tweet['text'].encode('utf-8', 'ignore').decode('utf-8'),
        'Public metrics': {
            'retweet_count': tweet['public_metrics']['retweet_count'],
            'reply_count': tweet['public_metrics']['reply_count'],
            'like_count': tweet['public_metrics']['like_count']
         },
         'Created At': tweet['created_at'],
         'Place': tweet['geo']
         
    }
            
    # Add the tweet data to the list
    tweets_data.append(tweet_data)

# Create a DataFrame from the extracted tweet data
df = pd.DataFrame(tweets_data)

# Load the existing CSV file
existing_df = pd.read_csv('tweets_ex.csv', encoding='utf-8')

# Concatenate the existing DataFrame and the new DataFrame
updated_df = pd.concat([existing_df, df], ignore_index=True)

# Drop duplicate tweets based on the 'Tweet ID'
updated_df.drop_duplicates(subset='Text', inplace=True)

# Save the updated DataFrame to the CSV file
updated_df.to_csv('tweets_ex.csv', encoding='utf-8', index=False)

# Print the updated number of rows
print(f"The updated number of rows in the CSV file is: {len(updated_df)}")

When I tried the next_token it said rate limit exceeded and had an 800+ second sleep time which I eventually had to interrupt. This ended up not returning any tweet extracts but it was using up my tweets extracts. Where I only asked for 35 tweets it showed on my dev portal that 2100 tweets had been pulled!

For the truncated tweets I tried tweet mode = extended but it not compatible with v2 i think

# Perform the search query for keyword and pagination
next_token = None
while True:
    query = f'{keywords} lang:en -is:retweet -is:quote -has:media'
    response = client.search_recent_tweets(
        query=query,
        max_results=35,
        start_time=start_date,
        end_time=end_date,
        tweet_fields=['id', 'text', 'created_at', 'public_metrics'],
        expansions=['geo.place_id'],
        next_token=next_token
    )
    
    for tweet in response.data:
        tweet_data = {
            'Tweet ID': tweet['id'],
            'Text': tweet['text'].encode('utf-8', 'ignore').decode('utf-8'),
            'Public metrics': {
                'retweet_count': tweet['public_metrics']['retweet_count'],
                'reply_count': tweet['public_metrics']['reply_count'],
                'like_count': tweet['public_metrics']['like_count']
            },
            'Created At': tweet['created_at'],
            'Place': tweet['geo']
        }
        tweets_data.append(tweet_data)

    if 'next_token' in response.meta:
        next_token = response.meta['next_token']
    else:
        break
  • This: "it said rate limit exceeded" doesn't mean try again with a smaller number, it means that you've done too much already - now you need to wait. – thebjorn Jul 23 '23 at 20:37
  • @thebjorn yes that’s what I understand it to mean. But this was my first request in 3 days and I had asked for 35 tweets. Is it normal to have that message with next_token ? – Cinnamon Onyx Jul 23 '23 at 22:28

0 Answers0