2

To give you an idea of the data:
DB has a collections/tables that has over a hundred million documents/records each containing more than 100 attributes/columns. The data size is expected to grow by hundred times soon.

Operations on the data:
There are mainly the following types of operations on the data:

  1. Validating the data and then importing the data into the DB, that happens multiple times daily
  2. Aggregations on this imported data
  3. Searches/ finds
  4. Updates
  5. Deletes

Tools/softwares used:

  1. MongoDB for database: PSS architecture based replicaset, indexes (most of the queries are INDEX scans)
  2. NodeJS using Koa.js

Problems:
HOWEVER, the tool is very badly slow when it comes to aggregations, finds, etc.

What have I implemented for performance so far?:

  1. DB Indexing
  2. Caching
  3. Pre-aggregations (using MongoDB aggregate to aggregate the data before hand and store it in different collections during importing to avoid aggregations at runtime)
  4. Increased RAM and CPU cores on the DB server
  5. Separate server for NodeJS server and Front-end build
  6. PM2 to manage NodeJS server application and for spawning clusters

However from my experience, even after implementing all the above, the application is not performant enough. I feel that the reason for this is that the data is pretty huge. I am not aware of how Big Data applications are managed to deliver high performance. Please advise.

Also, is the selection of technology not suitable or will changing the technology/tools help? If yes, what is advised under such scenarios?

I'm requesting your advise to help me improve the performance of the application.

Temp O'rary
  • 5,366
  • 13
  • 49
  • 109

1 Answers1

0

Not easy to give a correct answer because we do not really have that much details. What I would do is a detailed monitoring, at least the following:

Machine Level:

  • monitor the overall CPU load (for all cores) and RAM usage on your DB machine
  • monitor disk IO on the disks where the data is stored
  • this should show, if the machine specs are a bottleneck

Database & DB Process Level (my first guess, that this is the critical part):

  • what is the overall size of your data at the moment (I know, it will increase drastically but if it is already to slow now, this could be an interesting information - especially in relation to the current RAM size and number of CPU cores)
  • monitor memory usage and CPU load for your mongo DB process...
  • did a look on the query plans (while doing aggregations) guided you, what improvements can be done?
  • have look at the caching strategy. What strategy are you using?
  • this should give more detailed results on where to make improvements on a DB level. Is it just because of hardware bottlenecks or is it a aggregation problem...

Node.JS APP Level:

  • node.js app: how much RAM and CPU usage does this one take ...?
  • if there are multiple instances of the node.js app, track this for all instances
  • is the data import also happens through the nodejs app. Does the load on the app increases drastically while importing data?
  • if you see that you have a high load on this app that there is a need to act here (increasing instances, splitting it into seperate apps (e.g. import as a seperate app)
Sebastian Hildebrandt
  • 2,661
  • 1
  • 14
  • 20