9

We have an application running on Google App Engine using Datastore as persistence back-end. Currently application has mostly 'OLTP' features and some rudimentary reporting. While implementing reports we experienced that processing large amount of data (millions of objects) is very difficult using Datastore and GQL. To enhance our application with proper reports and Business Intelligence features we think its better to setup a ETL process to move data from Datastore to BigQuery.

Initially we thought of implementing the ETL process as App Engine cron job but it looks like Dataflow can also be used for this. We have following requirements for setting up the process

  • Be able to push all existing data to BigQuery by using Non streaming API of BigQuery.
  • Once above is done, push any new data whenever it is updated/created in Datastore to BigQuery using streaming API.

My Questions are

  1. Is Cloud Dataflow right candidate for implementing this pipeline?
  2. Will we be able to push existing data? Some of the Kinds have millions of objects.
  3. What should be the right approach to implement it? We are considering two approaches. First approach is to go through pub/sub i.e. for existing data create a cron job and push all data to pub/sub. For any new updates push data to pub/sub at the same time it is updated in DataStore. Dataflow Pipeline will pick it from pub/sub and push it to BigQuery. Second approach is to create a batch Pipeline in Dataflow that will query DataStore and pushes any new data to BigQuery.

Question is are these two approaches doable? which one is better cost wise? Is there any other way which is better than above two?

Thank you,

rizTaak

Dan McGrath
  • 41,220
  • 11
  • 99
  • 130
rizTaak
  • 188
  • 1
  • 12

1 Answers1

8

Dataflow can absolutely be used for this purpose. In fact, Dataflow's scalability should make the process fast and relatively easy.

Both of your approaches should work -- I'd give a preference to the second one of using a batch pipeline to move the existing data, and then a streaming pipeline to handle new data via Cloud Pub/Sub. In addition to the data movement, Dataflow allow arbitrary analytics/manipulation to be performed on the data itself.

That said, BigQuery and Datastore can be connected directly. See, for example, Loading Data From Cloud Datastore in BigQuery documentation.

Davor Bonaci
  • 1,709
  • 8
  • 9
  • The "automatic" solution using the Datastore Backup tool relies on the Files API which will be shut down in a few days. Today it was shut down temporarily. So it seems Dataflow will be the go_to solution to implement this. I see the examples provide info for writing to BigQuery, but where should we look for an example on reading from Datastore? I found the documentation and will work from that, but a full example of Datastore -> BigQuery using Dataflow would be really helpful. Thanks! – Zebs Aug 27 '15 at 20:59
  • An example reading from Datastore can be found in our [`cookbook`](https://github.com/GoogleCloudPlatform/DataflowJavaSDK-examples/blob/master/src/main/java/com/google/cloud/dataflow/examples/cookbook/DatastoreWordCount.java) examples. It should be straightforward to pair this with a BigQuery example. – Davor Bonaci Aug 28 '15 at 17:45