Apache Kylin looks like a great tool that will fill in the needs of a lot data scientists. It's also a very complex system. We are developing an in-house solution with exactly the same goal in mind, multidimensional OLAP cube with low query latency. Among the many issues, the one I'm concerned of the most right now is about fault tolerance. With large volumes of incoming transactional data, the cube must be incrementally updated, and some of the cuboids are updated over long period of time such as those with time dimension value at the scale of year. Over such long period, some piece of the complex system is guaranteed to fail, and how does the system ensure all the raw transactional records are aggregated into the cuboids exactly once, no more no less? Even each of the pieces has its own fault tolerance mechanism, it doesn't mean they will play together automatically. For simplicity, we can assume all the input data are saved in HDFS by another process, and can be "played back" in any way you want to recover from any interruption, voluntary or forced. What are Kylin's fault tolerance considerations, or is it not really an issue?
2 Answers
There are data faults and system faults.
Data fault tolerance: Kylin partitions cube into segments and allows rebuild an individual segment without impacting the whole cube. For example, assume a new daily segment is built on daily basis and get merged into weekly segment on weekend; weekly segments merge into monthly segment and so on. When there is data error (or whatever change) within a week, you need to rebuild only one day's segment. Data changes further back will require rebuild a weekly or monthly segment.
The segment strategy is fully customizable so you can balance the data error tolerance and query performance. More segments means more tolerable to data changes but also more scans to execute for each query. Kylin provides RESTful API, an external scheduling system can invoke the API to trigger segment build and merge.
A cube is still online and can serve queries when some of its segments is under rebuild.
System fault tolerance: Kylin relies on Hadoop and HBase for most system redundancy and fault tolerance. In addition to that, every build step in Kylin is idempotent. Meaning you can safely retry a failed step without any side effect. This ensures the final correctness, no matter how many fails and retries the build process has been through.
(I'm also Apache Kylin co-creator and committer. :-)

- 284
- 1
- 6
Notes: I'm Apache Kylin co-creator and committer.
The Fault Tolerance point is really good one which we actually be asked from some cases, when they have extreme large datasets. To calculate again from begin will require huge computing resources, network traffic and time.
But from product perspective, the question is: which one is more important between precision result and resources? For transaction data, I believe the exactly number is more important, but for behavior data, it should be fine, for example, the distinct count value is approximate result in Kylin now. It depends what's kind of case you will leverage Kylin to serve business needs.
Will put this idea into our backlog and will update to Kylin dev mailing list if we have more clear clue for this later.
Thanks.

- 260
- 1
- 2
- 7