Web-crawler for facebook in python

Question

I am tring to work with web-Crawler in python to print the number of facebook recommenders. for example in this article from sky-news(http://news.sky.com/story/1330046/are-putins-little-green-men-back-in-ukraine) there are about 60 facebook reccomends. I want to print this number in the python program with web-crawler. i tried to do this, but it doesn't print anything:

import requests
from bs4 import BeautifulSoup

def get_single_item_data(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)
    # if you want to gather information from that page
    for item_name in soup.findAll('span', {'class': 'pluginCountTextDisconnected'}):
        try:
                print(item_name.string)
        except:
                print("error")

get_single_item_data("http://news.sky.com/story/1330046/are-putins-little-green-men-back-in-ukraine")

If it prints nothing, then either all `item_name.string`s are `''`, or `soup.findAll` returns empty. So why don't you try a simple debug like `found = soup.findAll(...); print(found)`? — OJFord, Sep 04 '14 at 22:44
If it doesn't print anything, obviously the `for` loop is executing 0 times, which means `soup.findAll` isn't returning anything, which means there are no `` elements with that class. So… looking at the `soup`, what makes you think such elements exist? Can you post a stripped-down example of an HTML document that you think should work with this code, but doesn't? (See [MCVE](http://stackoverflow.com/help/mcve).) — abarnert, Sep 04 '14 at 22:44
Also, it worries me that you're using `findAll`, which was an "effectively deprecated" name in the late BS 3.x days, and is now a "legacy" name. This implies that you're copying and pasting some really ancient code (or following a very out-of-date tutorial), and if so, there are likely going to be a lot of problems. — abarnert, Sep 04 '14 at 22:46
abarnert, I am studying from thenewboston series that has been released 2 days ago — Yagel, Sep 04 '14 at 23:09
@Yagel any particular reason you can't just use the FB API? (Been a while since I've used it, but I think it had a way of getting likes related to external sites) — Jon Clements, Sep 04 '14 at 23:29

Celeo · Answer 1 · 2014-09-05T00:26:40.630

3

The Facebook recommends loads in an iframe. You can follow the iframe src attribute to that page, and then load the span.pluginCountTextDisconnected's text:

import requests
from bs4 import BeautifulSoup

url = 'http://news.sky.com/story/1330046/are-putins-little-green-men-back-in-ukraine'
r = requests.get(url) # get the page through requests
soup = BeautifulSoup(r.text) # create a BeautifulSoup object from the page's HTML

url = soup('iframe')[0]['src'] # search for the iframe element and get its src attribute
r = requests.get('http://' + url[2:]) # get the next page from requests with the iframe URL
soup = BeautifulSoup(r.text) # create another BeautifulSoup object

print(soup.find('span', class_='pluginCountTextDisconnected').string) # get the directed information

The second requests.get is written as such due to the src attribute returning //www.facebook.com/plugins/like.php?href=http%3A%2F%2Fnews.sky.com%2Fstory%2F1330046&send=false&layout=button_count&width=120&show_faces=false&action=recommend&colorscheme=light&font=arial&height=21. I added the http:// and ignored the leading //.

BeautifulSoup documentation
Requests documentation

edited Sep 05 '14 at 00:26

answered Sep 04 '14 at 23:00

Celeo

5,583
8
39
41

I dont really understand this 3 lines of code and what this do: 1.BeautifulSoup(r.text) 2.soup('iframe')[0]['src'] 3.requests.get('http://' + url[2:]) I just yesterday started to learn python, thank you – Yagel Sep 04 '14 at 23:38
1

@Yagel I added comments to several of the lines and two links for you to use. – Celeo Sep 05 '14 at 00:27
(I hope you will see this post...)I understand your answer, but when I tried to use it on this website for example-(http://tech.walla.co.il/?w=/4028/2782391) it doesn't work. I tried to change the code but it didn't work. thx – Yagel Sep 09 '14 at 00:28
On that website, there are multiple `iframe` elements. You'll need to modify your code to search for the `iframe` with a `src` attribute that contains `facebook` or some other proper identifier. – Celeo Sep 09 '14 at 02:46

score 2 · Answer 2 · answered Sep 04 '14 at 22:45

Facebook recommends are loaded dynamically from javascript, so they won't be available to your HTML parser. You will need to use the Graph API and FQL to get your answer directly from Facebook.

Here is a web console where you can explore queries once you have generated an access token.

Web-crawler for facebook in python

2 Answers2

Linked