0

I need to build a customer 360 degree database, which requires:

  • A wide-column table, each customer is one row, with lots of columns (says > 1000)
  • We have ~20 batch update analytics jobs running daily. Each analytics job queries and updates a small set of columns, for all the rows. It includes aggregating the data for reporting, and loading /saving the data for machine learning algorithms.
  • We update customers' info in several columns, with <= 1 million rows per day. The update workload is spread out across working hours. We have more than 200 million rows.

For these requirements, I think an modifiable columnar DB would be a perfect fit: it can be queried and aggregated by columns which is optimal for analytics, it can be updated for several million changes throughout the day. The most identical project I have found is Apache Kudu, but its limitation of 300 columns is a big turn-off, we have more than 1000.

And we prefer a open-source project.

Any suggestions ?

Tung Vs
  • 113
  • 10
  • Why not use any RDBMS system for this (mysql, postgres, mariadb)? Any particular reason for sticking to apache-kudu? – shanmuga Oct 02 '18 at 11:02
  • @shanmuga As far as I know, RDBMS(es) are limited in scaling out, we cannot just throw more servers into the cluster as we scale like No-SQL. We have a plan to grow the number of columns to more than 3000s. By the way, see my own answer below. – Tung Vs Oct 07 '18 at 10:14

1 Answers1

0

I will answer my own question, since our solution works fine now.

Instead of having a unified DB for both analytics and OLTP workload, we separate the workload into 2: analytics workload will be served by Parquet tables in HDFS, and OLTP one will be served by HBase.

Of course we have to duplicate (part of) the customer data, but with a not-so-much cost of storage and computing capacity that we are willing to pay.

Tung Vs
  • 113
  • 10