7

I would like to replicate the functionality that Facebook uses to parse a link. When you submit a link into your Facebook status, their system goes out and retrieves a suggested title, summary and often one or more relevant images from that page, from which you can choose a thumbnail.

My application needs to accomplish this using Python, but I am open to any kind of a guide, blog post or experience of other developers which relates to this and might help me figure out how to accomplish it.

I would really like to learn from other people's experience before just jumping in.

To be clear, when given the URL of a web page, I want to be able to retrieve:

  1. The title: Probably just the <title> tag but possibly the <h1>, not sure.
  2. A one-paragraph summary of the page.
  3. A bunch of relevant images that could be used as a thumbnail. (The tricky part is to filter out irrelevant images like banners or rounded corners)

I may have to implement it myself, but I would at least want to know about how other people have been doing these kinds of tasks.

Troy Alford
  • 26,660
  • 10
  • 64
  • 82
Ram Rachum
  • 84,019
  • 84
  • 236
  • 374

2 Answers2

3

BeautifulSoup is well-suited to accomplish most of this.

Basically, you simply initialize the soup object, then do something like the following to extract what you are interested in:

title = soup.findAll('title')
images = soup.findAll('img')

You could then download each of the images based on their url using urllib2.

The title is fairly simple, but the images could be a bit more difficult since you have to download each one to get the relevant stats on them. Perhaps you could filter out most of the images based on size and number of colors? Rounded corners, as an example, are going to be small and only have 1-2 colors, generally.

As for the page summary, that may be a bit more difficult, but I've been doing something like this:

  1. I use BeautifulSoup to remove all style, script, form, and head blocks from the html by using: .findAll, then .extract.
  2. I grab the remaining text using: .join(soup.findAll(text = True))

In your application, perhaps you could use this "text" content as the page summary?

I hope this helps.

Troy Alford
  • 26,660
  • 10
  • 64
  • 82
Donald Miner
  • 38,889
  • 8
  • 95
  • 118
  • 3
    BeautifulSoup is not well supported on Python 3.1, and its original author doesn't do much development anymore. You probably better use lxml.html and/or html5lib (the latter is recommended by the BeautifulSoup author). –  Jul 21 '10 at 12:09
  • Good to know for future reference. Thanks! – Donald Miner Jul 21 '10 at 12:25
1

Here's a complete solution: https://github.com/svven/summary

>>> import summary
>>> s = summary.Summary('http://stackoverflow.com/users/76701/ram-rachum')
>>> s.extract()
>>> s.title
u'User Ram Rachum - Stack Overflow'
>>> s.description
u'Israeli Python hacker.'
>>> s.image
https://www.gravatar.com/avatar/d24c45635a5171615a7cdb936f36daad?s=128&d=identic
on&r=PG
>>>
ducu
  • 1,199
  • 2
  • 12
  • 14