2

I'm a Python programmer specializing in web-scraping, I had to ask this question as I found nothing relevant.

I want to know what are the popular, well documented frameworks that are available for Python for scraping pure Javascript based sites? Currently I know Mechanize and Beautiful Soup but they do not interact with Javascript so I'm looking for something different. I would prefer something that would be as elegant and simple as mechanize.

I've done a bit of research and so far I've heard about Selenium, Selenium 2 and Windmill.

Right now I'm trying to choose among one these three and I do not know of any others.

So can anyone point out the features of these frameworks and what makes them different? I heard that Selenium uses a separate server to do all it's task and it seems to be feature rich. Also what is the core difference between Selenium and Selenium2? Please enlighten me if I'm wrong, and if you know of any other frameworks do mention it's features and other details.

Thanks.

Chacha Chowdhury
  • 757
  • 1
  • 7
  • 7
  • 1
    Quick comment: can you give us an example of a site you want to scrape? If I was writing a pure javascript site I'd make sure any act of getting data to populate it was written as a separate Ajax call, and the best way of 'scraping' would be to find that ajax call and get the data that way, rather than executing the javascript and then parsing the resulting structure. Sounds messy. Do the target sites provide any kind of API so the render/scrape process is unneccessary? – Spacedman Jun 12 '11 at 11:39
  • I do not know about the legal issues related to scraping from the site below or if they provide any api, but this is provided as an example: https://baesystems.taleo.net/careersection/2/jobsearch.ftl?lang=en – Chacha Chowdhury Jun 12 '11 at 11:47
  • 2
    Oh yuck. Whatever web designer thought that was a good way to do things needs to be shot. .. – Spacedman Jun 12 '11 at 12:06
  • 2
    I know lol, and the first question I was asked by a potential employer was can you scrape this site? Obviously I didn't, that's why I'm trying to learn a new framework that supports automated handling of javascript. – Chacha Chowdhury Jun 12 '11 at 12:31
  • Did you get an answer after all? I'm looking into learning how to scrape javascript sites too! – dmvianna Jan 24 '13 at 05:38

1 Answers1

0

Before using tools like Selenium that are designed for front end testing and not for scraping, you should have a look at where the data on the site comes from. Find out what XHR requests are made, what parameters they take and what the result is.

For example the site you mentioned in your comment does a POST request with lots of parameters in JavaScript and displays the result. You probably only need to use the result of this POST request to get your data.

stefanw
  • 10,456
  • 3
  • 36
  • 34
  • 1
    Hi, thanks for pointing out an alternative view on this. At the moment I don't have much idea on dealing with ajax calls and scraping data from dynamic sites. The site I mentioned shows contents that are dynamically generated by Javascript and they are not in html form. So, could you please point me towards resources, tutorials, etc. on learning how to do that? Also would I need knowledge of Javascript, Xml and Ajax? Thanks. – Chacha Chowdhury Jun 15 '11 at 08:50