4

In the Safari browser, I can right-click and select "Inspect Element", and a lot of code appears. Is it possible to get this code using Python? The best solution would be to get a file with the code in it.

More specifically, I am trying to find the links to the images on this page: http://500px.com/popular. I can see the links from "Inspect Element" and I would like to retrieve them with Python.

gary
  • 4,227
  • 3
  • 31
  • 58
EspenG
  • 537
  • 3
  • 10
  • 20
  • The code you're seeing in inspect element is the page source code, i.e. the text that the browser gets and translates to a visual website. The inspect element of chrome includes a lot of features for developers, but in its core it's still just source code. Of course you can get that information with python - but you're question is very general - are you trying to scrape an entire website, a single page, or do you want to get a specific part of it? What about stylesheets for that page? – yuvi Aug 10 '14 at 10:31
  • I am trying to find the links to the images on this page: http://500px.com/popular. And I can see the links from "Inspect Element", and retrieve them with Python. – EspenG Aug 10 '14 at 10:36
  • See? That's an entirley differnt question! Getting the source code for a webpage is a two-line job: `import urllib; urllib.urlopen("http://500px.com/popular").read();`. It's analysing that html that is usually the complicated part. I suggest you [learn a bit about scraping](http://en.wikipedia.org/wiki/Web_scraping), and start familiarizing yourself with python scraping libraries (such as [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/)) – yuvi Aug 10 '14 at 10:41
  • 4
    Oh well, look at that, 500px has an API. So there's also a [ready-made python client](https://github.com/500px/PxMagic) if you're lazy. – yuvi Aug 10 '14 at 10:46
  • I have tried that, and the links is not there.. So is not a easy way to just get the code from "Inspect Element", then? – EspenG Aug 10 '14 at 10:46
  • The API is the easiest way of retrieving the links, easier and more robust than scraping the page source. – Sven Marnach Aug 10 '14 at 10:50
  • Also, here's a link to [the API docs](https://github.com/500px/api-documentation)... y'know, so you'd learn how to use the client. it's incredibly detailed. I'm actaully kinda at awe here - they have a freakin API console to do tests and stuff. – yuvi Aug 10 '14 at 10:52

1 Answers1

6

One way to get at the source code of a web page is to use the Beautiful Soup library. A tutorial of this is shown here. The code from the page is shown below, the comments are mine. This particular code does not work as the contents have changed on the site it uses as an example, but the concept should help you to do what you want to do. Hope it helps.

from bs4 import BeautifulSoup
# If Python2:
#from urllib2 import urlopen
# If Python3 (urllib2 has been split into urllib.request and urllib.error):
from urllib.request import urlopen

BASE_URL = "http://www.chicagoreader.com"

def get_category_links(section_url):
    # Put the stuff you see when using Inspect Element in a variable called html.
    html = urlopen(section_url).read()    
    # Parse the stuff.
    soup = BeautifulSoup(html, "lxml")    
    # The next two lines will change depending on what you're looking for. This 
    # line is looking for <dl class="boccat">.  
    boccat = soup.find("dl", "boccat")    
    # This line organizes what is found in the above line into a list of 
    # hrefs (i.e. links). 
    category_links = [BASE_URL + dd.a["href"] for dd in boccat.findAll("dd")]
    return category_links

EDIT 1: The solution above provides a general way to web-scrape, but I agree with the comments to the question. The API is definitely the way to go for this site. Thanks to yuvi for providing it. The API is available at https://github.com/500px/PxMagic.


EDIT 2: There is an example of your question regarding getting links to popular photos. The Python code from the example is pasted below. You will need to install the API library.

import fhp.api.five_hundred_px as f
import fhp.helpers.authentication as authentication
from pprint import pprint
key = authentication.get_consumer_key()
secret = authentication.get_consumer_secret()

client = f.FiveHundredPx(key, secret)
results = client.get_photos(feature='popular')

i = 0
PHOTOS_NEEDED = 2
for photo in results:
    pprint(photo)
    i += 1
    if i == PHOTOS_NEEDED:
        break
gary
  • 4,227
  • 3
  • 31
  • 58
  • 1
    When I run: html = urlopen("http://500px.com/popular").read(), I can't find the links.. Is not the same code as if I right-click in my browser and select "Inspect Element". – EspenG Aug 10 '14 at 11:04
  • 1
    @user3465589: The links are late-loaded by some JavaScript embedded in that page. It will be really difficult to get the links unless you simply use the API. Honestly. – Sven Marnach Aug 10 '14 at 11:26
  • 1
    @user3465589 I copied and pasted code from an example written by the author of the API. – gary Aug 10 '14 at 11:49
  • Okay! But how do I install the API library..? – EspenG Aug 10 '14 at 11:56
  • 1
    @user3465589 you don't install an API, you [use it](http://en.wikipedia.org/wiki/Application_programming_interface) (in this case - you need to download and *use* the client) – yuvi Aug 10 '14 at 12:17
  • Right, some packages can be installed with pip (i.e. `pip install some-package-name`) but this API should just be downloaded and stored to a location where Python can see it. – gary Aug 10 '14 at 12:54