For context
We currently have a data mining script which collects a massive throughput of event data and publishes it to a Pub/Sub topic with a schema enforcement (with all fields being optional). A Pub/Sub-to-BigQuery subscription then sends the data to a destination table.
The problem
At present the script fetching the data sometimes also picks up events that do not conform to the schema. In order to avoid downtimes, the script has been configured to ingest erroneous messages and send them to the destination table in the form of a single field with a JSON string containing message contents while the rest of the fields for that row are nulls. This obviously poses a data quality issue, since end-users shouldn't have to filter through so many null values. However, since the schema failure could have resulted on a variety of fields, I need to somehow retain the raw JSON string and monitor which events are causing trouble.
Potential Solution
To move toward a cleaner approach, I have been exploring the use of a dead-letter topic to which these erroneous messages are sent. The idea is to have two separate BigQuery tables:
- One corresponding a stricter policy that filters out errors before they reach destination table.
- One which retains only the messages from the dead-letter topic (storing the unacknowledged messages in a GCS bucket would also be an option).
This would avoid the data-quality issue, while also retain information regarding errors in a more cost-efficient way.
Question
Is what I am trying to achieve possible? I have explored Google's documentation on dead-letter topics, but there is no details on how to retrieve unacknowledged messages.