0

Currently we are uploading the data retrieved from vendor APIs into Google Datastore. Wanted to know what is the best approach with data storage and querying the data.

I will be need to query millions of rows of data and will be extracting custom engineered features from the data. So wondering whether I should load the data into BigQuery directly and query it for faster processing or store it in Datastore and then move it to BigQuery for querying?. I will be using pandas for performing statistics on stored data.

Dan McGrath
  • 41,220
  • 11
  • 99
  • 130
user845405
  • 1,461
  • 5
  • 23
  • 43

3 Answers3

5

In general, Google Cloud Datastore is used for storing user data, accessed by applications. Google BigQuery is used for running analytical queries on data, so it sounds better suited to your proposed use case.

You can see the Google Cloud storage options table for a more detailed comparison.

Loading Datastore data directly into BigQuery will give you the best query performance, but you could also backup your Datastore to Cloud Storage and use Cloud Storage as an external data source for BigQuery.

To access BigQuery results in Pandas, you can use the pandas-gbq library or use the BigQuery integration with Datalab.

Tim Swast
  • 14,091
  • 4
  • 38
  • 61
0

As far as I can tell there is no support for Datastore in Pandas. This might affect your decision.

0

You may also consider the daily quota limit for INSERT/DELETE'ing operations that is 1000 for BigQuery whereas that is 20000 for Datastore (at the time of this writing). See references below:

On top of that, UPSERT or modification of rows does not look to be a recommended operation in BigQuery:

So these can help your decision from another aspect.

--Following is just my personal experience--

I was facing a similar situation of choices but after knowing this quota facts, I get an impression that BigQuery may not be always suited as data lake but you may load data firstly at Datastore then load some data for analysis later into BigQuery, as @tim-swast mentioned:

Junji Shimagaki
  • 286
  • 2
  • 9