0

We are trying to insert data into bigquery (streaming) using dataflow. Is there a way where we can keep a check on the number of records inserted into Bigquery? We need this data for reconciliation purpose.

dijeah
  • 303
  • 2
  • 13
  • Can you call an API? If so you can call [Tables.get](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/get) and see the table metadata or run a query on __TABLES__ with the tableId. Let me know if this is possible? – Tamir Klein Mar 26 '19 at 12:41
  • I will try this approach. But I'm afraid with streaming data going into BQ, the query may result in a number which we may not be able to validate. We may need to create a window though – dijeah Mar 26 '19 at 12:56
  • If you simply query the table with `select count(*)..`, would this give you the answer you're looking for? The streaming buffer is also queried in the process. – Graham Polley Mar 26 '19 at 13:11
  • For reconciliation, I need an initial count and final count. I can get a final count which was written to the BQ using a count(*). But I need an initial value to compare this count with. This initial value will be the one which needs to be from the pipeline which will count how many records have been received which will be written to the BQ. – dijeah Mar 26 '19 at 13:44
  • Do you need to do these counts inside your Dataflow pipeline? – Graham Polley Mar 26 '19 at 21:40

2 Answers2

1

Add a step to your dataflow which calls Google API Tables.get OR run this query before and after the flow (Both are equally good).

select row_count, table_id from `dataset.__TABLES__` where table_id = 'audit'

As an example, the query returns this

enter image description here

Tamir Klein
  • 3,514
  • 1
  • 20
  • 38
1

You also may be able to examine the "Elements added" by clicking on the step writing to bigquery in the Dataflow UI.

Alex Amato
  • 1,685
  • 10
  • 15