-3

I am pretty good with Python, so pseudo-code will suffice when details are trivial. Please get me started on the task - how do go about crawling the net for the snail mail addresses of churches in my state. Once I have a one liner such as "123 Old West Road #3 Old Lyme City MD 01234", I can probably parse it into City, State, Street, number, apt with enough trial and error. My problem is - if I use white pages online, then how do I deal with all the HTML junk, HTML tables, ads, etc? I do not think I need their phone number, but it will not hurt - I can always throw it out once parsed. Even if your solution is half-manual (such as save to pdf, then open acrobat, save as text) - I might be happy with it still. Thanks! Heck, I will even accept Perl snippets - I can translate them myself.

Matt
  • 22,721
  • 17
  • 71
  • 112
Hamish Grubijan
  • 10,562
  • 23
  • 99
  • 147
  • Thanks for the list of technologies. Now, which online directory provides the highest signal-to noise ratio? Also, what if the results span multiple pages? How can I click through them all using your technology of choice? – Hamish Grubijan Dec 14 '09 at 23:00

5 Answers5

2

Try lynx --dump <url> to download the web pages. All the troublesome HTML tags will be stripped from the output, and all the links from the page will appear together.

mob
  • 117,087
  • 18
  • 149
  • 283
2

You could use mechanize. It's a python library that simulates a browser, so you could crawl through the white pages (similarly to what you do manually).

In order to deal with the 'html junk' python has a library for that too: BeautifulSoup It is a lovely way to get the data you want out of HTML (of course it assumes you know a little bit about HTML, as you will still have to navigate the parse tree).

Update: As to your follow-up question on how to click through multiple pages. mechanize is a library to do just that. Take a closer look at their examples, esp. the follow_link method. As I said it simulates a browser, so 'clicking' can be realized quickly in python.

Frank
  • 10,461
  • 2
  • 31
  • 46
2

What you're trying to do is called Scraping or web scraping.

If you do some searches on python and scraping, you may find a list of tools that will help.

(I have never used scrapy, but it's site looks promising :)

Seth
  • 45,033
  • 10
  • 85
  • 120
2

Beautiful Soup is a no brainer. Here's a site you might start at http://www.churchangel.com/. They have a huge list and the formatting is very regular -- translation: easy to setup BSoup to scrape.

Peter Rowell
  • 17,605
  • 2
  • 49
  • 65
1

Python scripts might not be the best tool for this job, if you're just looking for addresses of churches in a geographic area.

The US census provides a data set of churches for use with geographic information systems. If finding all the x in a spatial area is a recurring problem, invest in learning a GIS. Then you can bring your Python skills to bear on many geographic tasks.

Matt Stephenson
  • 8,442
  • 1
  • 19
  • 19