Google Dataflow pipeline for varying schema

Question

I have a product to define and configure business workflows. A part of this product is a form-builder which enables users to setup different forms.

This entire forms data is backed on MongoDB in the following structure

- form_schemas

{
  "_id" : "",
  "name" : "",
  "account_id" : "",
  "fields" : [
    {
      "label" : "Enter Email",
      "name" : "email",
      "type" : "text",
      "required" : "true",
      "hidden" : "false",
      "additional_config" : { }
    },
    {
      "label" : "Select DOB",
      "name" : "dob",
      "type" : "date",
      "required" : "true",
      "hidden" : "false",
      "additional_config" : { }
    }
    ...
  ]
}

- form_datas

{
  "workflow_template_id" : "",
  "account_id" : ""
  "data" : {
    "email" : "xyx@gmail.com",
    "dob" : "2001-04-05"
  },
  "created_at" : "",
  "updated_at" : ""

}

As seen above the form can be for various different businesses. However, I am looking at data pipeline to transport the data to Google Bigquery at periodic intervals for analysis.

On BQ side, I am maintaining separate tables for each workflows

I have a current working solution which is completely written on Google Cloud Functions. I have a Google Scheduler Job run at periodic intervals invoking the different cloud functions. The cloud functions is doing the following things at high level

Iterate for each schema
Read the data mongodb for every schema since the last run (as cursor)
For each row of data, run the custom transformation logic (this includes transforming various nested data types like grids/lookup etc)
Write each row of transformed data directly as stream as ndjson on Google Cloud Storage

I above solution provides me with,

Complete control on transformation
Simple deployment

However since its all on CF, I am bound by limitation of 9 minutes per run. This essentially puts a lot of pagination requirements especially if there is a need to regenerate the complete data from beginning of time

While the above solution works fine for now, I was looking at other serverless options like Google data-flow. Since I am just starting on data-flow/apache beam, I was wondering

If I were to write a pipeline on beam, should I go with same approach of

Extract(Row by Row) -> Transform -> Load (GCS) -> Load (BQ)

or

Extract (entire data as JSON) -> Load to GCS -> Transform (Beam) -> Load to GCS -> Load to BQ

Let me know if there is any better option for entire data processing.

score 1 · Answer 1 · answered Nov 30 '20 at 20:23

1

Typically, this sort of process writes raw data to GCS and then transforms into Bigquery. This is done so that when you discover defects in the transform (which are inevitable) and the requirements change (also inevitable) you can replay the data with the new code.

Ideally, the steps prior to the transform are automated by a Change Data Capture (CDC) tool. There are plenty of CDC tools, but Debezium is taking over as it is reliable and free. There is a Debezium connector to get data from MongoDB and examples of how to put Debezium CDC into Bigquery.

If you are going to write the code that puts data to GCS, I would recommend considering using Apache Parquet rather than NDJSON as a format. Performance and cost will be better, and I find a format with data types easier to work with.

answered Nov 30 '20 at 20:23

Steven Ensslen

1,164
9
21

Thanks. I get the overall direction you are suggesting. I might for now fall back on MongoDB Change streams with pub-sub and cloud functions for CDC. With this approach I would be streaming the changes on near real-time. Are there any best practices on how the change data stream can be stored on GCS. e.g. Is it good idea to separate the data by timestamps (maybe daily) and then use the same by other downstream systems(apache beam) to push to say Big-query or other data warehouses – Sharath Chandra Dec 01 '20 at 10:22
It looks like dbt might also be a good option to consider by allowing to load entire raw data to BQ and then using dbt for transformation. – Sharath Chandra Dec 01 '20 at 11:25
@SharathChandra there are too many best practices for a stack overflow chat. My career is giving people advice in this area. I recommend reading some blogs and a book. – Steven Ensslen Dec 01 '20 at 21:12

Google Dataflow pipeline for varying schema

1 Answers1