I am pretty good with Python, so pseudo-code will suffice when details are trivial. Please get me started on the task - how do go about crawling the net for the snail mail addresses of churches in my state. Once I have a one liner such as "123 Old West Road #3 Old Lyme City MD 01234", I can probably parse it into City, State, Street, number, apt with enough trial and error. My problem is - if I use white pages online, then how do I deal with all the HTML junk, HTML tables, ads, etc? I do not think I need their phone number, but it will not hurt - I can always throw it out once parsed. Even if your solution is half-manual (such as save to pdf, then open acrobat, save as text) - I might be happy with it still. Thanks! Heck, I will even accept Perl snippets - I can translate them myself.

- 22,721
- 17
- 71
- 112

- 10,562
- 23
- 99
- 147
-
Thanks for the list of technologies. Now, which online directory provides the highest signal-to noise ratio? Also, what if the results span multiple pages? How can I click through them all using your technology of choice? – Hamish Grubijan Dec 14 '09 at 23:00
5 Answers
Try lynx --dump <url>
to download the web pages. All the troublesome HTML tags will be stripped from the output, and all the links from the page will appear together.

- 117,087
- 18
- 149
- 283
-
Huh? If you're scraping different web sites with arbitrary layouts, the HTML is more likely to get in your way. – mob Dec 14 '09 at 23:38
-
You could use mechanize. It's a python library that simulates a browser, so you could crawl through the white pages (similarly to what you do manually).
In order to deal with the 'html junk' python has a library for that too: BeautifulSoup It is a lovely way to get the data you want out of HTML (of course it assumes you know a little bit about HTML, as you will still have to navigate the parse tree).
Update: As to your follow-up question on how to click through multiple pages. mechanize is a library to do just that. Take a closer look at their examples, esp. the follow_link method. As I said it simulates a browser, so 'clicking' can be realized quickly in python.

- 10,461
- 2
- 31
- 46
What you're trying to do is called Scraping or web scraping.
If you do some searches on python and scraping, you may find a list of tools that will help.
(I have never used scrapy, but it's site looks promising :)

- 45,033
- 10
- 85
- 120
Beautiful Soup is a no brainer. Here's a site you might start at http://www.churchangel.com/. They have a huge list and the formatting is very regular -- translation: easy to setup BSoup to scrape.

- 17,605
- 2
- 49
- 65
Python scripts might not be the best tool for this job, if you're just looking for addresses of churches in a geographic area.
The US census provides a data set of churches for use with geographic information systems. If finding all the x
in a spatial area is a recurring problem, invest in learning a GIS. Then you can bring your Python skills to bear on many geographic tasks.

- 8,442
- 1
- 19
- 19
-
-
1Sure, the dataset is called TIGER/Line and is available at http://www.census.gov/geo/www/tiger/tgrshp2009/tgrshp2009.html To start use it, read up on GIS concepts and grab a free GIS like QuantumGIS – Matt Stephenson Dec 16 '09 at 02:24