2

I have a Dataflow pipeline which is reading messages from PubSub Lite and streams data into a BigQuery table. The table is partitioned by day. When querying the table with:

SELECT * FROM `my-project.my-dataset.my-table` WHERE DATE(timestamp) = "2021-10-14"

The BigQuery UI tells me This query will process 1.9 GB when run. But when actually running the query I don't get any results. My pipeline is running for a whole week now and I am getting the same results for the last two days. However, for 2021-10-11 and the days before that I am seeing actual results.

I am currently using Apache Beam version 2.26 and my Dataflow writer looks like this:

return BigQueryIO.<Event>write()
    .withSchema(createTableSchema())
    .withFormatFunction(event -> createTableRow(event))
    .withCreateDisposition(CreateDisposition.CREATE_NEVER)
    .withWriteDisposition(WriteDisposition.WRITE_APPEND)
    .withTimePartitioning(new TimePartitioning().setType("DAY").setField("timestamp"))
    .to(TABLE);

Why is BigQuery taking so long for committing the values to the partitions but at the same time telling me there is actually data available?

EDIT 1:

enter image description here

enter image description here

Wytrzymały Wiktor
  • 11,492
  • 5
  • 29
  • 37
p13rr0m
  • 1,107
  • 9
  • 21
  • If your table were really created partitioned, and there is no data for this date, the query editor should tell you that 0b will be processed, this is how the partitions work. Based on your description, seams the table partition is not working properly. Please open you table table on BigQuery and check in the Details -> Table info you can see the fields Table Type Partitioned Partitioned by DAY please also provide the example of your data and the data schema (please remove sensitive information). This way I can try to reproduce. – ewertonvsilva Oct 14 '21 at 14:06
  • Additionally, post a message to the topic with the data for your table and show the output on the data flow job, it have to says if the data was writen or if some error happened. `gcloud pubsub topics publish test-topic --message='{"ts": "2021-10-25 00:00:00 UTC","storeName":"zabka","itemId": "3","itemName": "New line from dataflow"}'` – ewertonvsilva Oct 14 '21 at 14:07

1 Answers1

0

BigQuery is processing data and not returning any rows because its processing also the data in your streaming buffer. Data on buffer is can take up to 90 min to be committed in the partitioned tables.

Check more details in this stack and also in the documentation available here.

When streaming to a partitioned table, data in the 
streaming buffer has a NULL value for the _PARTITIONTIME pseudo column.

If you are having problems to write the data from pubsub in BigQuery, I recommend you to use an template avaiable in dataflow.

Use an Dataflow template avaiable in GCP to write the data from PubSub to BigQuery:

There is an tempate to write data from a pubsub topic to bigquery and it already takes care of the possible corner cases.

I tested it as following and works perfectly:

  • Create a subscription in you PubSub topic;
  • Create bucket for temporary storage;
  • Create the job as following: enter image description here
  • For testing, I just sent a message to the topic in json format and the new data was added in the output table:

gcloud pubsub topics publish test-topic --message='{"field_dt": "2021-10-15T00:00:00","field_ts": "2021-10-15 00:00:00 UTC","item": "9999"}'

enter image description here

If you want something more complex, you can fork from the templates code from github and adjust it for your need.

ewertonvsilva
  • 1,795
  • 1
  • 5
  • 15
  • Thank you for your help. I think it is working now, but I am not completely sure. My hypothesis is that the problem comes from a slowly processing Dataflow pipeline, which has not yet been able to process the messages from the last couple of days. And the information I am seeing in BigQuery hence is a little misleading because the pipeline hasn't written any messages for that day yet. I don't know why the BigQuery UI is showing me that there is already data and why it is also increasing. – p13rr0m Oct 18 '21 at 13:46