Scraping URLs from a node.js data stream on the fly

Question

I am working with a node.js project (using Wikistream as a basis, so not totally my own code) which streams real-time wikipedia edits. The code breaks each edit down into its component parts and stores it as an object (See the gist at https://gist.github.com/2770152). One of the parts is a URL. I am wondering if it is possible, when parsing each edit, to scrape the URL for each edit that shows the differences between the pre-edited and post edited wikipedia page, grab the difference (inside a span class called 'diffchange diffchange-inline', for example) and add that as another property of the object. Right not it could just be a string, does not have to be fully structured.

I've tried using nodeio and have some code like this (i am specifically trying to only scrape edits that have been marked in the comments (m[6]) as possible vandalism):

    if (m[6].match(/vandal/) && namespace === "article"){
    nodeio.scrape(function(){
        this.getHtml(m[3], function(err, $){
            //console.log('getting HTML, boss.');
            console.log(err);
            var output = [];
            $('span.diffchange.diffchange-inline').each(function(scraped){
                output.push(scraped.text);
            });
            vandalContent = output.toString();

          });

        });
    } else {
        vandalContent = "no content";
    }

When it hits the conditional statement it scrapes one time and then the program closes out. It does not store the desired content as a property of the object. If the condition is not met, it does store a vandalContent property set to "no content".

What I am wondering is: Is it even possible to scrape like this on the fly? is the scraping bogging the program down? Are there other suggested ways to get a similar result?

score 0 · Accepted Answer · answered May 22 '12 at 16:57

0

I haven't used nodeio yet, but the signature looks to be an async callback, so from the program flow perspective, that happens in the background and therefore does not block the next statement from occurring (next statement being whatever is outside your if block).

It looks like you're trying to do it sequentially, which means you need to either rethink what you want your callback to do or else force it to be sequential by putting the whole thing in a while loop that exits only when you have vandalcontent (which I wouldn't recommend).

For a test, try doing a console.log on your vandalContent in the callback and see what it spits out.

answered May 22 '12 at 16:57

Paul

35,689
11
93
122

Yeah, you're definitely right (and the console.log spits out nothing, which indicates other issues I might be having, here). What might be a better way to rethink getting to scraping the content from each url and somehow attaching that to each edit object? I am fairly new to node. – roy May 22 '12 at 19:24
I'm going to consider this about as answered as this question is going to get, I am probably best off rethinking how to approach this problem. Thanks. – roy May 26 '12 at 16:04

Scraping URLs from a node.js data stream on the fly

1 Answers1