0

Thanks to @BittoBennichan, I have been able to build this little python thingy that scrapes user ids tagged in medias posted on Twitter:

from bs4 import BeautifulSoup
from selenium import webdriver
import time

# Create a new instance of the Firefox driver
driver = webdriver.Firefox()

# go to page
driver.get("http://twitter.com/XXXXXX/media")

#You can adjust it but this works fine
SCROLL_PAUSE_TIME = 2

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height


# Now that the page is fully scrolled, grab the source code.
src = driver.page_source

#Past it into BS
soup = BeautifulSoup(src, 'html.parser')
divs = soup.find_all('div',class_='account')

#PRINT RESULT
#print('printing results')
#for div in divs:
#    print(div['data-user-id'])


#SAVE IN FILE
print('Saving results')    
with open('file.txt','w') as f:
   for div in divs:
        f.write(div['data-user-id']+'\n')   

So the program works fine. It retrieves the ids and prints them or writes them into a txt file. I can now paste this list of ids into Calc and add a pivot table to see how many times each single id was tagged. BUT! I still have some problems:

-I only get the ids, not the usernames. Now what would be simpler: collect the usernames at the same time that I collect the ids and put them together in the file? Or convert the ids file into a username file late? And how would that last solution be possible?

-I can't scroll down infinitely. I got back to september 2018 but that's it. It just says "back to Top". Now, is it because I'm not logged into Twitter or because of some built-in limitation?

If you have any inputs, ideas, etc...any help would be appreciated. Thank!

EDIT1: I have found this (Tweepy) solution from here:

def get_usernames(ids):
    """ can only do lookup in steps of 100;
        so 'ids' should be a list of 100 ids
    """
    user_objs = api.lookup_users(user_ids=ids)
    for user in user_objs:
        print(user.screen_name)

So, as my list is longer than 100, I should do this:

For larger set of ids, you can just put this in a for loop and call accordingly while obeying the twitter API limit.

1 Answers1

0

Your code didn't generate ids for me, so couldn't test these solutions out initially. Not sure what the issue is as I didn't look into it, but seems like my source html does not have any class='account'. So I altered that in the code to just say, "Find all the div tags that have an attribute "data-user-id":

 divs = soup.find_all('div', {"data-user-id" : re.compile(r".*")})

1) To have a csv, you can just write and save as a csv, instead of txt. The other option is to create a dataframe with the ids then use pandas to write to a csv with df.to_csv('path/to/file.csv')

2) Putting this into a list is quite easy task as well.

Create List of IDs - for Loop

#TO PUT INTO LIST (FOR LOOP)
id_list = []
for div in divs:
    id_list.append(div['data-user-id'])

print (id_list)

Create List of IDs - List Comprehension

#TO PUT INTO LIST (LIST COMPREHENSION)
id_list = [ div['data-user-id'] for div in divs ]

Write to CSV

#SAVE IN FILE
import csv
print('Saving results')    
with open('file.csv','w', newline='') as f:
    writer = csv.writer(f)
    for div in divs:
        writer.writerow([div['data-user-id']])   
chitown88
  • 27,527
  • 4
  • 30
  • 59
  • Thanks a bunch, I'm gonna try that! I was also looking into a way to fetch the username at the same time that I'm fetching the data-user-id. And when that's done putting it in a csv with one column for ids and another for the usernames. – Max Baldwin Feb 22 '19 at 13:04
  • I had a quick look and do remember I saw username in there, so shouldn't be too difficult to match those user ids to usernames. Then when you write it csv it's just a matter of writing both the userid and username to a row – chitown88 Feb 22 '19 at 13:10
  • Yes I have to fetch the `data-screen-name` too, but I don't get how to do it with `find_all`. – Max Baldwin Feb 22 '19 at 14:31
  • And last thing, with `divs = soup.find_all('div', {"data-user-id" : re.compile(r".*")})` I get a lot of duplicates at the begining of the csv files. I think it's because there is multiple occurences of `data-user-id`in the code. – Max Baldwin Feb 22 '19 at 14:46
  • Plus there is something weird. Some accounts have long user ids, like 1024596885661802496 but the program output is 1024596885661800000 in the csv file. Why would that be??? EDIT: If I open the csv file with notepad the data is correct. But if I open it with Excel or Calc it's messed up. – Max Baldwin Feb 22 '19 at 14:48
  • The duplicates are easy to handle. You can either a) write the csv by making your list/dataframe first, and then use set or keep unique. Or you can incorporate it in your writer loop by skipping any duplicates. – chitown88 Feb 22 '19 at 17:45
  • I’m guessing the issue with excel is it auto puts it as type number, when it should be text/string. – chitown88 Feb 22 '19 at 17:50
  • Lastly, with the usernames, I’ll have to look at that later as I’m away from my laptop, but these are all good questions, you could post these up as you’ll get more attention on them since this current post already had solution/answered. Also, be sure to accept the solution if it answers/helps your original post. Cheers! – chitown88 Feb 22 '19 at 17:52
  • Thank you for your replies! The problem is that I want some duplicates to count them at some point and find out how many times was X user tagged. I realized too that Calc and Excel can't handle numbers that are too long and I have to format the column into text when I open the csv file. Then and it's my biggest problem, I have no idea how search for `data-screen-name`at the same time that I search for `data-user-id`. I 'd like to scrape both and put them in the csv (on column each) put so far I found no answers :( – Max Baldwin Feb 22 '19 at 17:54
  • I totally understand. That is weird about excel though. Like the length of a number shouldn’t matter if it’s string/text. Might just be an excel thing. And like I said, i can have a look at getting usernames, but can’t til later tonight/tomorrow. – chitown88 Feb 22 '19 at 17:59
  • Thanks a lot again for your patience and kindness. I'm gonna try to find the answer by myself in the meantime and will post it here if I succeed. – Max Baldwin Feb 22 '19 at 18:01