0

I am learning web scraping using python but I can't get the desired result. Below is my code and the output

code

import bs4,requests
url = "https://twitter.com/24x7chess"
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text,"html.parser")
soup.find_all("span",{"class":"account-group-inner"})
[]

Here is what I was trying to scrape

https://i.stack.imgur.com/tHo5S.png

I keep on getting an empty array. Please Help.

  • Why are you not using Twitter Official API? Web scrapping is not ideal for Twitter. – Saharsh Oct 21 '17 at 07:45
  • Actually I have just started with this and which is why I am going for more of a comprehensive path rather than just focusing on Twitter API –  Oct 21 '17 at 07:51

3 Answers3

2

Sites like Twitter load the content dynamically, which sometimes depends upon the browser you are using etc. And due to dynamic loading there could be some elements in the webpage which are lazily loaded, which means that the DOM is inflated dynamically, depending upon the user actions, The tag you are inspecting in your browser Inspect element, is inspected the fully dynamically inflated HTML, But the response you are getting using requests, is inflated HTML, or a simple DOM waiting to load the elements dynamically on the user actions which in your case while fetching from requests module is None.

I would suggest you to use selenium webdriver for scraping dynamic javascript web pages.

ZdaR
  • 22,343
  • 7
  • 66
  • 87
  • Hi. Thanks for taking out time. I have noticed something that I can scrape only the data which is there in the view source and not the data that I inspect on website. Could you please look into this? –  Oct 21 '17 at 07:31
  • @akulchhillar with requests you can only fetch the static DOM, for the required use case you need to use [`selenium`](http://selenium-python.readthedocs.io/) module – ZdaR Oct 21 '17 at 07:48
  • Thanks . I am learning selenium these days. by the way what if I use urllib for scrapping dynamic websites? –  Oct 21 '17 at 07:50
  • In my knowledge, `selenium` is the only popular option as of now, `requests`, `urllib`, etc are network libraries mainly used to get, send REST API end points and hence they are not developed to render dynamic javascript objects. – ZdaR Oct 21 '17 at 08:05
1

Try this. It will give you the items you probably look for. Selenium with BeautifulSoup is easy to handle. I've written it that way. Here it is.

from bs4 import BeautifulSoup 
from selenium import webdriver

driver = webdriver.Chrome()

driver.get("https://twitter.com/24x7chess")
soup = BeautifulSoup(driver.page_source,"lxml")
driver.quit()
for title in soup.select("#page-container"): 
    name = title.select(".ProfileHeaderCard-nameLink")[0].text.strip()
    location = title.select(".ProfileHeaderCard-locationText")[0].text.strip()
    tweets = title.select(".ProfileNav-value")[0].text.strip()
    following = title.select(".ProfileNav-value")[1].text.strip()
    followers = title.select(".ProfileNav-value")[2].text.strip()
    likes = title.select(".ProfileNav-value")[3].text.strip()
    print(name,location,tweets,following,followers,likes)

Output:

akul chhillar New Delhi, India 214 44 17 5
SIM
  • 21,997
  • 5
  • 37
  • 109
  • Thanks a lot. I have started using Selenium and it works like magic –  Oct 21 '17 at 16:07
  • If it works, make sure to mark this as an answer. Thanks. – SIM Oct 21 '17 at 16:30
  • Can I also use find_all method here instead of using select ? –  Oct 22 '17 at 02:24
  • Oh yeah, You can surely go on with `find_all` method as well. – SIM Oct 22 '17 at 02:50
  • how can I retrieve some data in a tag. For example if I have an anchor tag with a class name and a href with some value. I am able to target this anchor tag using class name but now I want to retrieve the value of href and store it. How can this be done ? –  Oct 22 '17 at 11:51
  • Open a new thread and drop here a link describing the requirement you expect to be fulfilled. Thanks. – SIM Oct 22 '17 at 11:53
0

You could have done the whole thing with requests rather than selenium

import requests
from bs4 import BeautifulSoup as bs
import re

r = requests.get('https://twitter.com/24x7chess')
soup = bs(r.content, 'lxml')
bio = re.sub(r'\n+',' ', soup.select_one('[name=description]')['content'])
stats_headers = ['Tweets', 'Following', 'Followers', 'Likes']
stats = [item['data-count'] for item in soup.select('[data-count]')]
data = dict(zip(stats_headers, stats))

print(bio, data)

QHarr
  • 83,427
  • 12
  • 54
  • 101