6

For the better part of the last year my company has been slicing up a monolith and building new products upon principles of (micro) service architecture. This is all fine and gives us great flexibility in keeping UI and backend logic separate and lowering the amount of dependencies.

BUT!

There is an important part of our business that has a growing headache as a result of this, namely reporting.

Since we make sure that there is no data replication (and business logic sharing) between services, then each service knows its own data and if another service really needs to keep a reference for that data, they do it through ID's (entity linking, essentially). And while otherwise its great, it's not great for reporting.

Our business often needs to create ad-hoc reports about specific instances happening with our customers. In the 'old days' you made a simple SQL query that joined a couple of database tables and queried whatever you needed, but it is not possible with decoupled services. And this is a problem as business sees it.

I am personally not a fan of data replication for reporting purposes in the back end, as that may have another tendency to grow into a nightmare (which it already is even in our legacy monoliths). So this problem is really not about legacy monoliths versus modern microservices, but about data dependencies in general.

Have you faced issues like this and if yes, then how did you solve it?

EDIT:

We have been discussing in-house the few potential solutions how to solve this, but none of them are actually good and I've not gotten the answer I am looking for yet that solves the issues in large scale.

  1. Good old replicate-everything-and-let-BI-people-figure-it-out is what is still used to this day. From the old monolith times the BI/data-warehouse team made duplicates of all databases, but same practice is more inconvenient, but still done to this day for all microservices that use a database. This is not good for various reasons and comes with the shared sandbox cancer you can expect.

  2. Build a separate microservice or a set of microservices that are meant for fetching out specific reports. Each of them connect to set microservices that carries the relevant data and builds the report as expected. This introduces tighter coupling however and can be incredibly complicated and slow with large datasets.

  3. Build a separate microservice or a set of microservices that each have databases replicated from other databases in background. This is problematic as team databases are being coupled and data is directly replicated and there is a strong dependency on technology of databases that is being used.

  4. Have each service send out an event to RabbitMQ that BI services would pick up on and then fetch additional data, if needed. It sounds by far the best for me, but by far the most complex to implement as all services need to start publishing relevant data. It is what I would personally choose at present time, from a very abstract level, that is.

kingmaple
  • 4,200
  • 5
  • 32
  • 44
  • *keeping UI and backend logic separate* - this is not the reason you do SOA. – tom redfern Apr 03 '17 at 10:05
  • https://www.infoq.com/articles/BI-and-SOA - may help – tom redfern Apr 03 '17 at 10:15
  • Possible duplicate of [Reports in SOA (Business Intelligence & Service Oriented Architecture)](http://stackoverflow.com/questions/9538710/reports-in-soa-business-intelligence-service-oriented-architecture) – tom redfern Apr 03 '17 at 10:16
  • could you provide more info about your reporting needs? concrete examples please – FuzzyAmi Apr 04 '17 at 05:58
  • It is incredibly ad-hoc. For example, business finds that it needs to get information about all the customers that have not logged in within a month. There are ways to do it (such as keeping this log-in time in customers service), but that is only the tip of the iceberg. Suddenly there is a requirement to get out customers that haven't logged in within a month, but who were active users before that, meaning that data is required from multiple services. – kingmaple Apr 05 '17 at 08:30
  • You already have a couple answers that tell you how to do it with a BI approach (central repo with all the data gathered through events) and I'd say that's the correct way. However, re. this: `In the 'old days' you made a simple SQL query that joined a couple tables (...) but it is not possible with decoupled services` How come? It might not be as _simple_ as a single SQL query, but you certainly should be able to call whichever services are necessary, combine their data and build the report. – walen Apr 06 '17 at 06:54
  • Let every Microservice write it's own log file, add a transaction id to it and use a tool like Splunk. (Where you can have multiple indexers) – NickD Apr 07 '17 at 05:42

2 Answers2

2

So, I'm not sure this would answer your needs - but i'll describe our overall approach to BI:

  1. Everything in our system generates an event: actions in the backend, actions in the mobile apps - everything we want to track produces event with the relevant data (ids, time, name etc).
  2. All the events are sent to some common funnel for collection - its a backend app that takes events - makes sure they're valid - and stores them.
  3. You can store the events in some no-sql storage (like Elasticsearch) or on a cloud (like google's BigQuery).
  4. Once they're in, its just a matter of querying and cross-referencing to get the overall picture you want. That's what our BI people do: they generate a picture from the heaps of events we collect.
FuzzyAmi
  • 7,543
  • 6
  • 45
  • 79
  • This is a good reply and one I got close to myself as well in terms of event driven BI similar to my edited original post as option #4. But any way to do this iteratively? It is possibly a huge change. – kingmaple Apr 06 '17 at 11:46
  • Assuming you already have some solution, then implementing the event-approach should be transparent. Start by determining which events you want and what they should look like, and gradually add them to your apps. You dont even have to actually collect them into a database on the first stage. Its enough if you setup a webserver to receive them (but does nothing more then sends 200ok) – FuzzyAmi Apr 06 '17 at 11:55
2

The solution is to aggregate data from different services into a central reporting database - this is feasible if the data collected is versioned by time - i.e you can go to the reporting data and get a point-in-time data which is correct (for that time)

Getting that data to the service can be via events published by the various service or periodic imports, "log" aggregation or combinations of them.

I call this pattern aggregated reporting

Note that in addition to that you still to get data from individual services for things that needs to be up-to-date as an aggregation solution has inherent delay (reduced freshness)

Edit: Considering the edits you've made and the comments you've made (ad-hoc queries) I'd say you need to treat this as a journey , that is, you want to get to option 4 so start by pulling data from sources you have to answer you current ad-hoc needs, convert to messages as you move forward with development and add more sources.

Also you may want to think about the difference between services (that don't share internal data structures between them and have strict boundaries) and aspects (semi-independent parts of service that can use the same source)

PS I also wrote that InfoQ piece on BI & SOA Tom mentioned in the comments - which essentially talks about this idea - this article is from 2007, i.e. I've successfully applied it this for more than a decade now (different technologies , moving from schema on write to schema on read etc. but same principles)

Arnon Rotem-Gal-Oz
  • 25,469
  • 3
  • 45
  • 68