0

First of all, apologies in advance for the long story and using the wrong terminology sometimes.

Hopefully someone can advice us in how to optimally implement BigQuery into our organization.

Current setup
At the moment, we have a data warehouse in MySQL, with normalized data using a snowflake schema.

Data is gathered from other SQL and MySQL databases using mainly a Simego - Ouvvi automation server.

We use Qlikview and PowerBI software to visualize this data.

This all works fine, however the last year we've been working towards a more event driven approach. We aim to create events for basically everything that happens in our organization, meaning that we will be creating a lot more data.

Furthermore, we'd like to combine internal data with different data sources like Google Analytics.

Future setup
We've started building a Rabbit MQ to BigQuery pipeline using apache beam, to stream our data into a BigQuery dataset (called 'datalake' for now).

So far we're doing okay hopefully.

Now, to be able to use these messages for reporting purposes, we need to transform these event messages. Our idea is to query our messages and store the properties we need into a dataset called 'datawarehouse'. We'd like to use a snowflake schema for this warehouse, with some nested fields.

But how to accomplish this? I've played around with the BigQuery web UI for some queries, and I guess it's doable to just use scheduled queries.

However, we'd prefer using one (or more) tool(s) that makes it possible to (ordered by importance):

  • Design our datawarehouse schema (we use MySQL workbench in our old datawarehouse. Far from perfect, but good enough).
  • Transform data more easily than just using queries.
  • Apply version control. Who changed what, when etc.
  • Review changes before applying them.
  • Apply continuous deployment.

I've spent a lot of time googling, but there are so many tools I feel a bit overwhelmed. So any help, suggestion or criticism is highly appreciated!

Oebie
  • 19
  • 5
  • IMO python with Airflow and an ELT approach is worth considering. You may transform the data using intermediate table while your sql query remains in python environment. That way github can control your code so your review/version control can be taken care. – Fact Jan 16 '19 at 02:03
  • Thanks a lot, ill have a look on monday. – Oebie Jan 18 '19 at 12:34
  • Thank you @Fact, this is very helpful. Will have to figure out how to use it, but your suggestion seems promising! – Oebie Jan 30 '19 at 11:32

0 Answers0