Thanks to @BittoBennichan, I have been able to build this little python thingy that scrapes user ids tagged in medias posted on Twitter:
from bs4 import BeautifulSoup
from selenium import webdriver
import time
# Create a new instance of the Firefox driver
driver = webdriver.Firefox()
# go to page
driver.get("http://twitter.com/XXXXXX/media")
#You can adjust it but this works fine
SCROLL_PAUSE_TIME = 2
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
# Now that the page is fully scrolled, grab the source code.
src = driver.page_source
#Past it into BS
soup = BeautifulSoup(src, 'html.parser')
divs = soup.find_all('div',class_='account')
#PRINT RESULT
#print('printing results')
#for div in divs:
# print(div['data-user-id'])
#SAVE IN FILE
print('Saving results')
with open('file.txt','w') as f:
for div in divs:
f.write(div['data-user-id']+'\n')
So the program works fine. It retrieves the ids and prints them or writes them into a txt file. I can now paste this list of ids into Calc and add a pivot table to see how many times each single id was tagged. BUT! I still have some problems:
-I only get the ids, not the usernames. Now what would be simpler: collect the usernames at the same time that I collect the ids and put them together in the file? Or convert the ids file into a username file late? And how would that last solution be possible?
-I can't scroll down infinitely. I got back to september 2018 but that's it. It just says "back to Top". Now, is it because I'm not logged into Twitter or because of some built-in limitation?
If you have any inputs, ideas, etc...any help would be appreciated. Thank!
EDIT1: I have found this (Tweepy) solution from here:
def get_usernames(ids):
""" can only do lookup in steps of 100;
so 'ids' should be a list of 100 ids
"""
user_objs = api.lookup_users(user_ids=ids)
for user in user_objs:
print(user.screen_name)
So, as my list is longer than 100, I should do this:
For larger set of ids, you can just put this in a for loop and call accordingly while obeying the twitter API limit.