2

I am whatever comes before 'novice' in programming. I have written macros in VBA for Excel, and used Visual Studio a bit when I was younger, but that's about it.

My problem: To produce the reports I need at work, I have to extract data that is stored behind user-friendly query forms on my company's intranet. I have automated every other part of the report except this. I would like to write a program to access this webpage and fill in query forms for me with preset values, and then return the data that is output. I had a discussion with a computer scientist friend of mine who said this was easy to do with Haskell (his language of choice). However I'm no veteran so I'd like to learn a language a bit nearer to my level... Python seems a good bet.

My question: is it possible to do this type of data extraction with Python? How difficult would it be, and what is a good resource to teach myself about it?

I've done some research and come up with Scrapy, but I can't tell whether it fills in forms. Also, if there are other languages more suited to this, I'd be glad to hear it.

Sputnik
  • 123
  • 2
  • 5

3 Answers3

1

I would start by reading some basic tutorials on HTTP. A form is basically just a visual way to collect data. The meat of the form is the request your browser makes with that form data.

So "filling in forms" is really not necessary (it may be though, hopefully its not because it CAN get complicated). What is necessary is learning what request that form actually makes to the browser and emulating it. A super easy way to do this is with chrome developer tools or a firefox extensions called firebug. Each of these provide you with a way to see all network traffic, including forms.

for example if you have a form where you have to submit a data and a report type the actual web request may look like

?date=2012-09-12&type=overview

so basically you would just have to find a way to make a http request to the url with that data. This is a trivial task and pretty much all languages have a way to do this.

It is very possible to do this with python. There is an abundance of tutorials out there. Python has url libraries built into the standard library that can help http://docs.python.org/library/urllib.html

Everytime I use urllib2 I usually end up at http://www.voidspace.org.uk/python/articles/urllib2.shtml

dm03514
  • 54,664
  • 18
  • 108
  • 145
  • This is a goldmine; thank you so much for explaining the basics! I know very little about HTTP but those browser extensions sound great. So does urllib. – Sputnik Sep 14 '12 at 13:50
1

The easiest way is just to use urllib2. Usually, arguments to your forms are transferred to the servers so that you can see them in the URL as ?foo=bar&bla=blah. You can generate arguments to your forms with urllib2.urlencode:

Python and urllib2: how to make a GET request with parameters.

For a newbie, you formulate your thoughts very clear, congrats.

Community
  • 1
  • 1
Boris Burkov
  • 13,420
  • 17
  • 74
  • 109
0

Combining loginform and scrapy, you can automate filling forms and crawling web pages. Here's a tutorial on it. http://blog.scrapinghub.com/2012/10/26/filling-login-forms-automatically/