I'm trying to circumvent the problem of snscrape not supporting gathering tweets evenly throughout the day. But I'm running into some issues with the output of the data I get. I want to collect tweets mentioning stock tickers from the SP500. But for testing I'm currently using AAPL and MSFT.
This is my code:
from datetime import datetime, timedelta
import snscrape.modules.twitter as sntwitter
import pandas as pd
# Creating list to append tweet data to
Tweets = []
tickers = ['AAPL','MSFT']
timeperiod = (datetime.strptime('2022-09-01', '%Y-%m-%d') -
datetime.strptime('2021-06-01', '%Y-%m-%d')).days * 24
startime = datetime.now()
start_time = 1622498400
end_time = 1622502000
for s in tickers:
for t in range(240):
try:
for i, tweet in enumerate(sntwitter.TwitterSearchScraper(
f'{s} since_time:{start_time} until_time:{end_time}').get_items()):
if i > 60:
break
Tweets.append({'Date': tweet.date, 'Text': tweet.content, 'Ticker': s})
except RuntimeError:
print('Error occurred')
end_time = datetime.strptime(f'{datetime.fromtimestamp(end_time)}', '%Y-%m-%d %H:%M:%S') \
+ timedelta(hours=t)
start_time = end_time - timedelta(hours=1)
start_time = start_time.timestamp()
end_time = end_time.timestamp()
# Creating a dataframe to load the list
tweets_df = pd.DataFrame(Tweets, columns=['Date', 'Text', 'Ticker'])
tweets_df.to_csv('sampleTwitter.csv', encoding='UTF-8')
runtime = datetime.now() - startime
print(runtime)
The problem occurs when the code is finished and I look at the csv. Where I only get tweets from the first hours of the starting day. The i should break after collecting 60 tweets within the hour specified and move to the next hour and so on. I want to run this for a longer time period so for testing I currently use 10 days which equals 240 hours to loop through.
Since since_time and until_time accepts epoch time my idea is to update the epoch date with the hours I want to scrape from. My logic for this is that since_time is always equal until_time - 1 hour, and until_time equals the initial end_time + t which is hours from the initial end_time. To my understanding, which is limited in python, it does not properly collect and store the tweets. This is mainly because I get roughly 30 tweets when I should be getting 60x24x10 = 14 400 tweets (given that there is enough tweets within the hour about the query.)
The expected output is something like this:
Date Text Ticker 2021-05-31 22:57:17+00:00 sample AAPL 2021-05-31 22:47:27+00:00 sample AAPL 2021-05-31 21:47:27+00:00 sample AAPL 2021-05-31 20:47:27+00:00 sample AAPL
continuing for 10 days.
But the current output is this:
Date Text Ticker 2021-05-31 22:57:17+00:00 sample AAPL 2021-05-31 22:47:27+00:00 sample AAPL 2021-05-31 22:45:27+00:00 sample AAPL 2021-05-31 22:44:27+00:00 sample AAPL
only the last hours of the first day.
EDIT: Fixed the fault by creating a function and removing parts of the code.
def convertTime(var_time,var):
time = datetime.strptime(f'{datetime.fromtimestamp(var_time)}', '%Y-%m-%d %H:%M:%S')+timedelta(hours=1)*var
time = int(time.timestamp())
return time
If anyone is interested this code does what I want to achieve.
Tweets = []
start_time = 1622498400
end_time = 1622502000
for t in range(72):
for i, tweet in enumerate(sntwitter.TwitterSearchScraper(
f'your-query since_time:{convertTime(start_time, t)} until_time:{convertTime(end_time, t)}').get_items()):
if i > 60:
break
Tweets.append({'Date': tweet.date, 'Text': tweet.content, 'Ticker': 'your-iterator'})
tweets_df = pd.DataFrame(Tweets, columns=['Date', 'Text', 'Ticker'])
tweets_df.to_csv('sampleTwitter.csv', encoding='UTF-8')