0

In this document, Apache Beam suggests the deadletter pattern when writing to BigQuery. This pattern allows you to fetch rows that failed to be written from the transform output with the 'FailedRows' tag.

However, when I try to use it:

WriteToBigQuery(
    table=self.bigquery_table_name,
    schema={"fields": self.bigquery_table_schema},
    method=WriteToBigQuery.Method.FILE_LOADS,
    temp_file_format=FileFormat.AVRO,
)

A schema mismatch in one of my elements causes the following exception:

Error message from worker: Traceback (most recent call last):
File 
    "/my_code/apache_beam/io/gcp/bigquery_tools.py", line 1630, 
    in write self._avro_writer.write(row) File "fastavro/_write.pyx", line 647,
    in fastavro._write.Writer.write File "fastavro/_write.pyx", line 376,
    in fastavro._write.write_data File "fastavro/_write.pyx", line 320,
    in fastavro._write.write_record File "fastavro/_write.pyx", line 374,
    in fastavro._write.write_data File "fastavro/_write.pyx", line 283,
    in fastavro._write.write_union ValueError: [] (type <class 'list'>) do not match ['null', 'double'] on field safety_proxy During handling of the above exception, another exception occurred: Traceback (most recent call last): File "apache_beam/runners/common.py", line 1198,
    in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 718,
    in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 841,
    in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "apache_beam/runners/common.py", line 1334,
    in apache_beam.runners.common._OutputProcessor.process_outputs File "/my_code/apache_beam/io/gcp/bigquery_file_loads.py", line 258,
    in process writer.write(row) File "/my_code/apache_beam/io/gcp/bigquery_tools.py", line 1635,
    in write ex, self._avro_writer.schema, row)).with_traceback(tb) File "/my_code/apache_beam/io/gcp/bigquery_tools.py", line 1630,
    in write self._avro_writer.write(row) File "fastavro/_write.pyx", line 647,
    in fastavro._write.Writer.write File "fastavro/_write.pyx", line 376,
    in fastavro._write.write_data File "fastavro/_write.pyx", line 320,
    in fastavro._write.write_record File "fastavro/_write.pyx", line 374,
    in fastavro._write.write_data File "fastavro/_write.pyx", line 283,
    in fastavro._write.write_union ValueError: Error writing row to Avro: [] (type <class 'list'>) do not match ['null', 'double'] on field safety_proxy Schema: ...

From what I gather, the schema mismatch causes fastavro._write.Writer.write to fail and throw an exception. Instead, I would like WriteToBigQuery to apply the deadletter behavior and return my malformed rows as FailedRows tagged output. Is there a way to achieve this?

Thanks

EDIT: Adding more detailed example of what I'm trying to do:

from apache_beam import Create
from apache_beam.io.gcp.bigquery import BigQueryWriteFn, WriteToBigQuery
from apache_beam.io.textio import WriteToText

...

valid_rows = [{"some_field_name": i} for i in range(1000000)]
invalid_rows = [{"wrong_field_name": i}]

pcoll = Create(valid_rows + invalid_rows)

# This fails because of the 1 invalid row
write_result = (
    pcoll 
    |  WriteToBigQuery(
        table=self.bigquery_table_name,
        schema={
            "fields": [
                {'name': 'some_field_name', 'type': 'INTEGER', 'mode': 'NULLABLE'},
            ]
        },
        method=WriteToBigQuery.Method.FILE_LOADS,
        temp_file_format=FileFormat.AVRO,
    )
)

# What I want is for WriteToBigQuery to partially succeed and output the failed rows.
# This is because I have pipelines that run for multiple hours and fail because of 
# a small amount of malformed rows
(
    write_result[BigQueryWriteFn.FAILED_ROWS] 
    | WriteToText('gs://my_failed_rows/')
)
cozos
  • 787
  • 10
  • 19
  • Can you show us more code please ? how you write the final output to Bigquery and how you try to recover the `FailedRows` PCollection and try to write it to BigQuery. – Mazlum Tosun Sep 03 '22 at 08:35
  • Edited with more detailed code – cozos Sep 03 '22 at 09:14
  • Thanks 2 questions, have you tried without `Avro` temp file format ? Can you detect the bad lines on your side in the Pipeline instead of let the Bigquery IO catching errors for you ? If it's the case I can propose you a solution bsed on TupleTags and dead letter queue in the Pipeline and not in the Bigquery IO. – Mazlum Tosun Sep 03 '22 at 10:59
  • Hi, thanks for the reply. I use Avro for performance reasons - GCP docs say that Avro is the preferred and fastest way of batch loading files: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#advantages_of_avro. When I do use JSON, the job succeeds on the write temporary files stage but fails in the LoadJob due to bad rows / schema mismatch – cozos Sep 03 '22 at 17:41
  • I have tried to prevalidate on my side using fastavro.validate (fastavro is the library that Apache Beam uses) but it is very expensive performance wise and uses about 15X more CPU – cozos Sep 03 '22 at 17:47
  • Your bad lines are only detectable by a fastavro validation ? I thought about a function to detect if the current element is considered as a bad line. If it's the case, you can raise an exception and catch the error via TupleTag and dead letter queue in the pipeline. If you confirm me it's possible, I will propose you an elegant solution. – Mazlum Tosun Sep 03 '22 at 22:03
  • My understanding is that the "throrough" way to validate a Avro row is by using `fastavro.validate`. I guess there are other ways for a row to be bad (i.e. null/None, wrong type, etc) but fastavro.validate will catch all of those. Am I understanding this wrong? Nevertheless, I'd be interested to see the TupleTag suggestion. Thanks! – cozos Sep 05 '22 at 06:37

2 Answers2

2

You can use a dead letter queue in the pipeline instead of let BigQuery catch errors for you. Beam proposes a native way for error handling and dead letter queue with TupleTags but the code is little verbose.

I created an open source library called Asgarde for Python sdk and Java sdk to apply error handling for less code, more concise and expressive code :

https://github.com/tosun-si/pasgarde

(also the Java version : https://github.com/tosun-si/asgarde)

You can install it with pip :

asgarde==0.16.0
pip install asgarde==0.16.0

from apache_beam import Create
from apache_beam.io.gcp.bigquery import BigQueryWriteFn, WriteToBigQuery
from apache_beam.io.textio import WriteToText
from asgarde.collection_composer import CollectionComposer

def validate_row(self, row) -> Dict :
    field = row['your_field']
        
    if field is None or field == '':
        # You can raise your own custom exception
        raise ValueError('Bad field')
...

valid_rows = [{"some_field_name": i} for i in range(1000000)]
invalid_rows = [{"wrong_field_name": i}]

pcoll = Create(valid_rows + invalid_rows)

# Dead letter queue proposed by Asgarde, it's return output and Failure PCollection.
output_pcoll, failure_pcoll = (CollectionComposer.of(pcoll)
  .map(self.validate_row))

# Good sink
(
    output_pcoll 
    |  WriteToBigQuery(
        table=self.bigquery_table_name,
        schema={
            "fields": [
                {'name': 'some_field_name', 'type': 'INTEGER', 'mode': 'NULLABLE'},
            ]
        },
        method=WriteToBigQuery.Method.FILE_LOADS
    )
)

# Bad sink : PCollection[Failure] / Failure contains inputElement and 
# stackTrace.
(
    failure_pcoll 
    | beam.Map(lambda failure : self.your_failure_transformation(failure))
    |  WriteToBigQuery(
        table=self.bigquery_table_name,
        schema=your_schema_for_failure_table,
        method=WriteToBigQuery.Method.FILE_LOADS
    )
)

The structure of Failure object proposed by Asgarde lib :

@dataclass
class Failure:
    pipeline_step: str
    input_element: str
    exception: Exception

In the validate_row function, you will apply your validation logic and detect bad fields. You will raise an exception in this case, and Asgarde will catch the error for you.

The result of CollectionComposer flow is :

  • PCollection of output, in this case, I think is a PCollection[Dict]
  • PCollection[Failure]

At the end you can process to multi sink :

  • Write good outputs to Bigquery
  • Write failures to Bigquery

You can also apply the same logic with native Beam error handling and TupleTags, I proposed an exemple in a project from my Github repository :

https://github.com/tosun-si/teams-league-python-dlq-native-beam-summit/blob/main/team_league/domain_ptransform/team_stats_transform.py

Mazlum Tosun
  • 5,761
  • 1
  • 9
  • 23
  • Thank you very much! I'm struggling to understand how "TupleTags" factors into this - I don't see any "TaggedOutput" or "TupleTags" in the code example? How does it work? – cozos Sep 07 '22 at 11:29
  • You are very welcome :) the ˋTupleTagsˋ logic is encapsulated in the ˋAsgarde` library. For a native ˋBeam` example, you can check the link I shared at end of my answer. – Mazlum Tosun Sep 07 '22 at 12:31
1

Let's step back slightly on aims and outcomes desired.

Why is "FILE_LOADS" required as a bigquery write method?

https://beam.apache.org/releases/javadoc/2.29.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.Method.html

Are you also aware of the BigQuery Storage Write API: https://cloud.google.com/bigquery/docs/write-api

It looks like the java sdk supports the BQ Write API, but not currently the python sdk. I believe using the write API would connect over gRPC to write into BigQuery, rather than needing to serialize to avro to then call the [ legacy ] batch load process?

Perhaps take a look and see if that helps -- schemas are important, but it seems AVRO is irrelevant to your aims and in there just because of the code you are calling?

bruce
  • 209
  • 2
  • 8
  • Hi, thanks for the reply. The reason behind `FILE_LOADS` is that I am running a batch Apache Beam pipeline, for which the default method is "FILE_LOADS": https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L2271-L2272. I am assuming that the default is chosen for performance reasons, as batch loading makes more sense for batch jobs (i.e. amortizing network IO). – cozos Sep 03 '22 at 05:43
  • As for why Avro - it is also for performance reasons. GCP documentation says that Avro is the preferred format for loading data: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#advantages_of_avro as it is the fastest to load, is splittable when compressed, doesn't need serialization, etc. It seems like Apache Beam added Avro support to reduce size of serialized data: https://issues.apache.org/jira/browse/BEAM-8841 ("JSON has some disadvantages including size of serialized data...) – cozos Sep 03 '22 at 05:49
  • So to summarize, Avro/FILE_LOADS doesn't affect any functional requirements but are the most optimal settings for me performance wise. Changing it could introduce a performance regression and increase costs for me. – cozos Sep 03 '22 at 05:55