0

I'm not a (very good) back-end developer so understanding processes and memory are a little about my pay grade.

I'm currently building an app using the MEAN stack. I have a separate Express server running on localhost that is a web scraper.

The flow I have is that my Angular app gathers users data -> sends it to the MEAN express backend -> the Express route sends a POST request to my web scraper, then the scraper does its thing (uses requestjs to get the page, uses cheerio to load in the data and does some parsing of HTML).

The scraping process can take a little while (up to 5 minutes) so I want to send update messages to the browser. Currently I do this:

  1. Every 5 seconds, the browser sends a GET request to my MEAN API asking it to request an update message from my web scraper
  2. The MEAN API sends a GET request to my web scraper server
  3. The web scraper server checks the progress (just a local variable that is used in function).

This works, yet while the scraper is running the update responses are VERY slow. See below for a log:

Node Console Log

It seems my web scraper server is struggling under load from just 1 user's request (scraping about 1500 websites). I can only imagine that when 10, 20, 1000 users are using the service then the whole thing will just crumble.

Is my flow completely wrong here? I feel like I'm in a bit over my head but I'd like to learn and debug where my web scraper is lagging and see what I can do to optimise it!

EDIT: As per the title - is this an issue where I'm not allocating enough memory to my Node/Express server or something?

Jascination
  • 101
  • 4

1 Answers1

0

This is an older post but I figured I could help someone out with this post.

First, the screenshot is only of the scraper server responses, that will not help too much from my perspective trying to answer, however, I have been there before.

I am taking it that your node/express scraper and your node/express/angular app are on the same hardware/shared server(hardware, not server instance).

If that is the case, you could run a 32 bit python scraper and throttle a medium setup if your scraper is looping requests and responses as fast as the wire will allow.

You will likely want to #1 - log your scraper success and fail rates on different base urls, to make sure you are not getting blacklisted. #2 - wait each loop of the scraper, if even for milliseconds. #3 - and this is the most important, please get your scraper off your angular hardware.

We setup 10 virtual desktops and 10 physical desktops all running scrapers - and that is all they do. The problem is the http requests and responses and scraping of the sites (especially if done asynchronously) will tax the hardware and the connection on that machine. I usually have a web app as the only thing running on that server it sits on, the db(s) are on different servers, CDN is separate, and document/image storage is even different from that. I know it sounds a little complicated or daunting at first, but seperation of concerns makes it way easier to debug bottlenecks.

Hope this helps moving forward.