0

I'm fairly new to Javascript programming, and am working on a web scraping script built using CasperJS.

The issue is that it's fairly slow, but it works. I'm trying to think of a way to make an overlaying script/program that starts this script I created, but I'm not sure the best way to do so. I have experimented with the GNU Parallel command, but I'd prefer something using JS, PHP or Python, as I'm more familiar with those languages.

I am also aware that the CasperJS instances will share the cookies and local storage, but that's not an issue on my use case. If anyone more experienced with this kind of architecture and framework could assist me, I'd appreciate it.

Thanks!

  • Define "slow". Can you describe the bottlenecks you think you're experiencing? Could you use [`xargs` to run a bunch of instances in parallel](https://stackoverflow.com/questions/28357997/running-programs-in-parallel-using-xargs)? – tadman Oct 12 '17 at 02:44
  • Did you spend an hour walking through the tutorial for GNU Parallel? gnu.org/software/parallel/parallel_tutorial.html – Ole Tange Oct 12 '17 at 20:15
  • @tadman the slowness I'm referencing is related to the site I'm scraping. To do all the actions I need, it takes about 3 minutes. Considering I'll need to run about 60 of these before restarting, you can see why I'd want to run simultaneous jobs :) – Vinícius Fabri Oct 12 '17 at 23:35
  • @OleTange I didn't go all in on it, but I did read some of the sections and did some testing on the terminal. I'd to look for other alternatives for now (specially in Node) as I'm not that familiar with Shell. – Vinícius Fabri Oct 12 '17 at 23:36

1 Answers1

0

I ended up using NodeJS' child_process: https://nodejs.org/api/child_process.html It was pretty much what I wanted and used the same language I already used in the CasperJS script.

Tutorial and example I used: https://era86.github.io/2012/10/11/quick-and-dirty-nodejs-exec-limit-queue.html