How to crawl a website/extract data into database with python?

Question

I'd like to build a webapp to help other students at my university create their schedules. To do that I need to crawl the master schedules (one huge html page) as well as a link to a detailed description for each course into a database, preferably in python. Also, I need to log in to access the data.

How would that work?
What tools/libraries can/should I use?
Are there good tutorials on that?
How do I best deal with binary data (e.g. pretty pdf)?
Are there already good solutions for that?

score 12 · Accepted Answer · edited May 23 '17 at 12:16

12

requests for downloading the pages.
- Here's an example of how to login to a website and download pages: https://stackoverflow.com/a/8316989/311220
lxml for scraping the data.

If you want to use a powerful scraping framework there's Scrapy. It has some good documentation too. It may be a little overkill depending on your task though.

edited May 23 '17 at 12:16

Community

1
1

answered Dec 01 '11 at 01:55

Acorn

49,061
27
133
172

Would you recommend the same for this: http://stackoverflow.com/questions/23917790/how-to-web-crawl-some-sites – Si8 May 28 '14 at 17:46

score 4 · Answer 2 · edited Jun 12 '15 at 21:02

Scrapy is probably the best Python library for crawling. It can maintain state for authenticated sessions.

Dealing with binary data should be handled separately. For each file type, you'll have to handle it differently according to your own logic. For almost any kind of format, you'll probably be able to find a library. For instance take a look at PyPDF for handling PDFs. For excel files you can try xlrd.

score 3 · Answer 3 · answered Dec 01 '11 at 02:02

3

I liked using BeatifulSoup for extracting html data

It's as easy as this:

from BeautifulSoup import BeautifulSoup 
import urllib

ur = urllib.urlopen("http://pragprog.com/podcasts/feed.rss")
soup = BeautifulSoup(ur.read())
items = soup.findAll('item')

urls = [item.enclosure['url'] for item in items]

answered Dec 01 '11 at 02:02

Alexey Grigorev

2,415
28
47

I am using this too. I need to crawl about 1000 links on the same site ... but it takes too long... would you suggest me some better approach? I can show the code too – Nov 07 '14 at 15:04

score 0 · Answer 4 · answered Sep 21 '14 at 07:57

0

For this purpose there is a very useful tool called web-harvest Link to their website http://web-harvest.sourceforge.net/ I use this to crawl webpages

answered Sep 21 '14 at 07:57

Riz

368
4
18

How to crawl a website/extract data into database with python?

4 Answers4

Linked