Scripted Browser Scapper

Question

What can I use to achieve the following, script a browser or otherwise make a request to the server, login, browse the site, eg. find links and navigate to those links.

For now, since I am into NodeJS, I was looking at node.io. It allows you to scrape site quite easily but problem is when I try to post (to login) I get nothing back!

nodeio = require "node.io"

nodeio.scrape ->

    @post "http://localhost/auth/login", {
        username: "username"
        password: "password"
    }, ->

        console.log "=====After Login====="

But I just get

OK: Job complete

Even if the login fails, I should get to after login console.log?

Then I was thinking maybe its better to implement this by scripting a browser instead, it will simulate more closely a real request?

score 2 · Answer 1 · answered Jul 22 '12 at 06:08

2

Selenium or Watir let you script a browser. They use the actual browser, so they will be slower than lower level tools, but they do everything a browser will (ie, JavaScript).

answered Jul 22 '12 at 06:08

Sam King

2,068
18
29

I have tried the Zombie.JS route which doesn't seem to work on some sites (not controlled by me), probably they detect it could be a bot connecting or something, and reject the connection. So am going the Selenium route, works, but its just slower, abit too slow for my liking, but I suppose I can just leave it running. I think the cause for the slowness is that everytime I do a `get(url)`, it waits for the entire page including any ads or scripts to render before continuing? – jm2 Jul 23 '12 at 01:21
I know that Watir waits for the whole page to load (not scripts, though). I heard that Selenium didn't, but they might have changed that. – Sam King Jul 23 '12 at 12:43

score 2 · Answer 2 · answered Jul 22 '12 at 18:35

2

node.io seems like it's a good tool for the job, but I'd also recommend zombie.js. It seems to be geared mostly towards testing, but the docs look like it'll be great for scraping, too.

If you want to go the scripted browser route, ignore my answer. :)

answered Jul 22 '12 at 18:35

rdrey

9,379
4
40
52

It appears some sites block my connection, or somehow it doesn't work as maybe Zombie/NodeIO is meant for testing/accessing sites controlled by you? Maybe I need to set the user agent etc? – jm2 Jul 23 '12 at 01:22
yeah, the sites you're scraping might check your user-agent, or have API rate limits per client/IP. – rdrey Jul 23 '12 at 06:59
Hmm... how can I make Zombie/NodeIO behave more like a real browser? Will just sending correct HTTP headers like user agent work? What are the usual headers sent by a real browser? – jm2 Jul 23 '12 at 09:17
you could always run a little node server on localhost, do a GET with a browser and print out all headers, to directly copy your browser's behaviour :) – rdrey Jul 23 '12 at 13:03

Scripted Browser Scapper

2 Answers2