Possible to create pipeline that writes an SQL database to MongoDB daily?

Question

TL:DR I'd like to combine the power of BigQuery with my MERN-stack application. Is it better to (a) use nodejs-biquery to write a Node/Express API directly with BigQuery, or (b) create a daily job that writes my (entire) BigQuery DB over to MongoDB, and then use mongoose to write a Node/Express API with MongoDB?

I need to determine the best approach for combining a data ETL workflow that creates a BigQuery database, with a react/node web application. The data ETL uses Airflow to create a workflow that (a) backs up daily data into GCS, (b) writes that data to BigQuery database, and (c) runs a bunch of SQL to create additional tables in BigQuery. It seems to me that my only two options are to:

Do a daily write/convert/transfer/migrate (whatever the correct verb is) from BigQuery database to MongoDB. I already have a node/express API written using mongoose, connected to a MongoDB cluster, and this approach would allow me to keep that API.
Use the nodejs-biquery library to create a node API that is directly connected to BigQuery. My app would change from MERN stack (BQ)ERN stack. I would have to re-write the node/express API to work with BigQuery, but I would no longer need the MongoDB (nor have to transfer data daily from BigQuery to Mongo). However, BigQuery can be a very slow database if I am looking for a single entry, a since its not meant to be used as Mongo or a SQL Database (it has no index, one row retrieve query run slow as full table scan). Most of my APIs calls are for very little data from the database.

I am not sure which approach is best. I don't know if having 2 databases for 1 web application is a bad practice. I don't know if it's possible to do (1) with the daily transfers from one db to the other, and I don't know how slow BigQuery will be if I use it directly with my API. I think if it is easy to add (1) to my data engineering workflow, that this is preferred, but again, I am not sure.

Possible? Yes. Try also asking on https://reddit.com/r/bigquery - since the question seems to have too large of a scope — Felipe Hoffa, Apr 20 '20 at 19:21
@Canovice. Few suggestions on this as we have recently done something similar to transfer DB from Postgres to MongoDB. You can write cronjobs to keep both the database in sync so that no need to have human interaction for the transfer. You need to also keep a matrix on how much data is successfully transferred and handle all kinds of errors. Adding a reference to anyone of DB to others might be a big help in the future while solving bugs. — Nayan, Apr 24 '20 at 19:34

score 4 · Accepted Answer · edited Apr 26 '20 at 13:14

4

I am going with (1). It shouldn't be too much work to write a python script that queries tables from BigQuery, transforms, and writes collections to Mongo. There are some things to handle (incremental changes, etc.), however this is much easier to handle than writing a whole new node/bigquery API.

edited Apr 26 '20 at 13:14

halfer

19,824
17
99
186

answered Apr 23 '20 at 19:48

Canovice

9,012
22
93
211

score 1 · Answer 2 · answered Apr 29 '20 at 20:25

FWIW in a past life, I worked on a web ecommerce site that had 4 different DB back ends. ( Mongo, MySql, Redis, ElasticSearch) so more than 1 is not an issue at all, but you need to consider one as the DB of record, IE if anything does not match between them, one is the sourch of truth, the other is suspect. For my example, Redis and ElasticSearch were nearly ephemeral - Blow them away and they get recreated from the unerlying mysql and mongo sources. Now mySql and Mongo at the same time was a bit odd and that we were dong a slow roll migration. This means various record types were being transitioned from MySql over to mongo. This process looked a bit like: - ORM layer writes to both mysql and mongo, reads still come from MySql. - data is regularly compared. - a few months elapse with no irregularities and writes to MySql are turned off and reads are moved to Mongo.

The end goal was no more MySql, everything was Mongo. I ran down that tangent because it seems like you could do similar - write to both DB's in whatever DB abstraction layer you used ( ORM, DAO, other things I don't keep up to date with etc.) and eventually move the reads as appropriate to wherever they need to go. If you need large batches for writes, you could buffer at that abstraction layer until a threshold of your choosing was reached before sending it.

With all that said, depending on your data complexity, a nightly ETL job would be completely doable as well, but you do run into the extra complexity of managing and monitoring that additional process. Another potential downside is the data is always stale by a day.

Great insights, thank you! my bigquery database is going to be the single source of truth. The goal is for the Mongo database to be completely re-creatable from the bigquery database — Canovice, May 01 '20 at 21:00

Possible to create pipeline that writes an SQL database to MongoDB daily?

2 Answers2