5

I have written an API which retrieves data from a MongoDB database. I also have a front-end application which uses the data from the API (both applications are written in Node JS using the Koa framework if it's relevant)

I need to do an aggregation of a large set of numerical data, over a given period that would need calculating (averaging, quintiles etc), and this could be all data grouped by month, grouped by year or by personID.

I've read some examples where people say that the API should be used as a wrapper for the database layer, presenting access only to the raw data - but it makes sense to me that the logic would live on the database (rather than asking the front-end application to churn over the data).

Is this a common problem, and from your own experience - is it be better to get the API to do the aggregation, or the front-end application?

Example documents

{
    "date": ISODate("2016-07-31T07:34:05+01:00Z"),
    "value": 5,
    "personID": 123
},
{
    "date": ISODate("2016-08-01T12:53:05+01:00Z"),
    "value": 3,
    "personID": 789
}
Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
ash
  • 1,224
  • 3
  • 26
  • 46
  • 3
    An API must NOT be a wrapper over your database. It should contain all of your use cases and business logic. If your data has to be aggregated to be consumed as such, aggregate it in the backend. Aggregation that are fast to compute and are only used to show a summary in a front-end view could be computed on the client computer. The goal of an API is to have a common gateway to your system; clients and sub-systems alike must go through the same business cases. This ensure consistency and sanity of your system. – Jazzwave06 Aug 01 '16 at 13:27
  • Thanks @sturcotte06 - users make requests to my front-end application, which then makes the request to the API, meaning I can't make use of the client computer for the aggregation. I've done this for security reasons (as I don't give all of my users a unique API key and only the application makes use of a key). Just to check I understand correctly - are you saying that it is perfectly acceptable for an API to aggregate data for a front-end application to use the data? Thanks again. – ash Aug 01 '16 at 13:33
  • 3
    Yes, but it should not be the job of your API. You should have an aggregation process which run off hours. It would process all the data for the day and push it to aggregate data sets. Your API then serve this aggregate dataset as resources. In other words, your API is responsible for serving business cases and applying business rules. It should not be responsible for batch processing, as this is a time-consuming process and does not scale well with rapidly growing datasets. It is, however, the responsiblity of your back-end to process this data. – Jazzwave06 Aug 01 '16 at 13:35
  • 1
    Thank you @sturcotte06 - this is really helpful. My only concern is that my data updates frequently throughout the day and needs to be viewed in real-time (unless I have a refresh rate of say every 'n' minutes). I was thinking that I could make use of MongoDB's aggregation functions - applying appropriate index(es) of course, but I am reconsidering this now after reading this. Really helpful, thank you again. – ash Aug 01 '16 at 13:46
  • 1
    It really depends on the scope of your aggregation. The aggregation process could run every 5 minutes. The only problem is if the process takes longer than 5 minutes. Good luck! – Jazzwave06 Aug 01 '16 at 13:50

2 Answers2

2

There's two perspectives you can approach this from: Security or Performance

From the security angle, any data that you put on the front-end is considered, for security purposes, to be "dirty". This means if you accept any input whatsoever, you have to throw out any assumptions that the input is even remotely valid. Especially with large data-sets, you would need to do some form of validation on each of the Create/Update operations. While at first glance you might think putting things on the client-side takes the load off the server, unless you want exploits everywhere, you're still doing some sort of iteration on the data, if only to validate it.

From the performance angle, moving large data sets to the client is going to happen either way, but same size doesn't need to come back. Keeping the operations on the server means your Update style operations are much smaller, as they don't need to move the entire data-set over wire but they can. To take it a step further, you can guarantee that at the very least you'll have control over the performance of the operations, where as if you offload this on the client, you're going to have to support every client's machine to some degree, which is a nightmare.

tl;dr: Security & Performance dictate heavily in favor of server side operations, especially on large data-sets.

  • Thank you for this considered answer - my question is purely for Read operations, as data is imported to MongoDB elsewhere. To extend on my original question - an example query might be to view the average value by month in a given year. There are ~30K documents (rows of data) that need to be calculated at the moment (this year), and in my mind it seems insane to send those across the network to be grouped accordingly and calculated. That said, my API needs to be accessed by others, and I want to keep it as 'clean' as possible (rather than just tailored for my front-end app). – ash Aug 01 '16 at 14:27
  • 1
    In that case a I recommend a REST API that does the actual averages or other similar mathematical operations in the rear with the gear, and presents the data in a JSON format (for example) that your specific front-end consumes. This allows other services to consume the service as well. Decide on your standard for how the service presents the data, and build your front-end around it. See the tail end of the performance section: The less your client does, the less you have to support directly. – Shayne Fitzgerald Aug 01 '16 at 17:56
  • 1
    Addendum: In reference to the comments on your original post, sturcotte06 is correct in saying that you should determine if your operations can be effectively served "on demand". A lot of that will depend on if your data updates frequently. The more frequently records are created, the more frequently you will need to process the data set as a whole. – Shayne Fitzgerald Aug 01 '16 at 17:59
1

I have never thought about an API as being about raw data. The API is whatever the application wants - and rarely is that an SQL proxy.

A frontend engineer building a webapp probably wants to plug in discrete business entities into their frontend app, component by component. But they probably want to be able to make batch call, and for backend performance and complexity to be completely masked. A mobile developer probably wants the above AND to grab a special aggregated mobile view of the data.

In its simplest form, Aggregator would be a simple web page that invokes multiple services to achieve the functionality required by the application. Since each service (Service A, Service B, and Service C) is exposed using a lightweight REST mechanism, the web page can retrieve the data and process/display it accordingly. If some sort of processing is required, say applying business logic to the data received from individual services, then you may likely have a CDI bean that would transform the data so that it can be displayed by the web page.

http://blog.arungupta.me/microservice-design-patterns/

Which to paraphrase is saying - you can just access services via widgets in a webpage. But if you need to process data for the frontend, use a backend service.

I've googled from "frontend view aggregation" and found nothing of note. Which suggests to me, if your data aggregation going to stay as simple as a few widgets on a single client platform - by all means stay in frontend while you can. But the second the app grows in complexity, you will be presented with problems that can only be solved by a backend solution such as:

  • Code reuse for enrichment/transformation across client platforms/views
  • Rate limiting
  • Performance/caching
  • Security

Considering the above, I think backend view aggregation is an important piece tech that only rarely could get away with at scale.

The good news is there is a healthy ecosystem e.g. Strongloop (js) or plain old Express (node) to name a few. Plus there is a good amount of literature from engineering heros on how to implement more complicated versions (spotify, )

EDIT: Correct Kong does not support Api aggregation at this moment

Community
  • 1
  • 1
Ashley Coolman
  • 11,095
  • 5
  • 59
  • 81