3

When writing simple records to a table in Postgres (could be any db) at the end of a pipeline, some of the potential records violate uniqueness constraints and trigger an exception. As far as I can tell, there's no straight forward way to handle these gracefully - the pipeline either errors out completely, or depending on the runner, enters an interminable death spiral.

There doesn't appear to be any mention of error handling for this case in the beam docs. The medium posts on error handling don't seem to apply to this particular type of PTransform which returns PDone.

This answer isn't comprehensible and is devoid of examples.

In my example, I'm reading from a file with 2 duplicate lines and trying to write them into a table.

CREATE TABLE foo (
    field CHARACTER VARYING(100) UNIQUE
);

foo.txt contains:

a
a

The pipeline looks like this:

        Pipeline p = Pipeline.create();
        p.apply(TextIO.read().from("/path/to/foo.txt"))
        .apply(
                JdbcIO.<String>write()
                .withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create("org.postgresql.Driver", "jdbc:postgresql://localhost:5432/somedb"))
                .withStatement("INSERT INTO foo (field) VALUES (?)")
                .withPreparedStatementSetter(new JdbcIO.PreparedStatementSetter<String>() {
                    private static final long serialVersionUID = 1L;
                    public void setParameters(String element, PreparedStatement query) throws SQLException {
                        query.setString(1, element);
                    }
                }))
        ;
        p.run();

Here is the output from the simple example above:

[WARNING]
org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.sql.BatchUpdateException: Batch entry 0 INSERT INTO foo (field) VALUES ('a') was aborted: ERROR: duplicate key value violates unique constraint "foo_field_key"
  Detail: Key (field)=(a) already exists.  Call getNextException to see other errors in the batch.
    at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish (DirectRunner.java:332)
    at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish (DirectRunner.java:302)
    at org.apache.beam.runners.direct.DirectRunner.run (DirectRunner.java:197)
    at org.apache.beam.runners.direct.DirectRunner.run (DirectRunner.java:64)
    at org.apache.beam.sdk.Pipeline.run (Pipeline.java:313)
    at org.apache.beam.sdk.Pipeline.run (Pipeline.java:299)
    at com.thing.Main.main (Main.java:105)
    at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:498)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
    at java.lang.Thread.run (Thread.java:748)
Caused by: java.sql.BatchUpdateException: Batch entry 0 INSERT INTO foo (field) VALUES ('a') was aborted: ERROR: duplicate key value violates unique constraint "foo_field_key"
  Detail: Key (field)=(a) already exists.  Call getNextException to see other errors in the batch.
    at org.postgresql.jdbc.BatchResultHandler.handleError (BatchResultHandler.java:148)
    at org.postgresql.core.ResultHandlerDelegate.handleError (ResultHandlerDelegate.java:50)
    at org.postgresql.core.v3.QueryExecutorImpl.processResults (QueryExecutorImpl.java:2184)
    at org.postgresql.core.v3.QueryExecutorImpl.execute (QueryExecutorImpl.java:481)
    at org.postgresql.jdbc.PgStatement.executeBatch (PgStatement.java:840)
    at org.postgresql.jdbc.PgPreparedStatement.executeBatch (PgPreparedStatement.java:1538)
    at org.apache.commons.dbcp2.DelegatingStatement.executeBatch (DelegatingStatement.java:345)
    at org.apache.commons.dbcp2.DelegatingStatement.executeBatch (DelegatingStatement.java:345)
    at org.apache.commons.dbcp2.DelegatingStatement.executeBatch (DelegatingStatement.java:345)
    at org.apache.commons.dbcp2.DelegatingStatement.executeBatch (DelegatingStatement.java:345)
    at org.apache.beam.sdk.io.jdbc.JdbcIO$Write$WriteFn.executeBatch (JdbcIO.java:846)
    at org.apache.beam.sdk.io.jdbc.JdbcIO$Write$WriteFn.finishBundle (JdbcIO.java:819)
Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "foo_field_key"
  Detail: Key (field)=(a) already exists.
    at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse (QueryExecutorImpl.java:2440)
    at org.postgresql.core.v3.QueryExecutorImpl.processResults (QueryExecutorImpl.java:2183)
    at org.postgresql.core.v3.QueryExecutorImpl.execute (QueryExecutorImpl.java:481)
    at org.postgresql.jdbc.PgStatement.executeBatch (PgStatement.java:840)
    at org.postgresql.jdbc.PgPreparedStatement.executeBatch (PgPreparedStatement.java:1538)
    at org.apache.commons.dbcp2.DelegatingStatement.executeBatch (DelegatingStatement.java:345)
    at org.apache.commons.dbcp2.DelegatingStatement.executeBatch (DelegatingStatement.java:345)
    at org.apache.commons.dbcp2.DelegatingStatement.executeBatch (DelegatingStatement.java:345)
    at org.apache.commons.dbcp2.DelegatingStatement.executeBatch (DelegatingStatement.java:345)
    at org.apache.beam.sdk.io.jdbc.JdbcIO$Write$WriteFn.executeBatch (JdbcIO.java:846)
    at org.apache.beam.sdk.io.jdbc.JdbcIO$Write$WriteFn.finishBundle (JdbcIO.java:819)
    at org.apache.beam.sdk.io.jdbc.JdbcIO$Write$WriteFn$DoFnInvoker.invokeFinishBundle (Unknown Source)
    at org.apache.beam.repackaged.beam_runners_direct_java.runners.core.SimpleDoFnRunner.finishBundle (SimpleDoFnRunner.java:285)
    at org.apache.beam.repackaged.beam_runners_direct_java.runners.core.SimplePushbackSideInputDoFnRunner.finishBundle (SimplePushbackSideInputDoFnRunner.java:118)
    at org.apache.beam.runners.direct.ParDoEvaluator.finishBundle (ParDoEvaluator.java:223)
    at org.apache.beam.runners.direct.DoFnLifecycleManagerRemovingTransformEvaluator.finishBundle (DoFnLifecycleManagerRemovingTransformEvaluator.java:73)
    at org.apache.beam.runners.direct.DirectTransformExecutor.finishBundle (DirectTransformExecutor.java:188)
    at org.apache.beam.runners.direct.DirectTransformExecutor.run (DirectTransformExecutor.java:126)
    at java.util.concurrent.Executors$RunnableAdapter.call (Executors.java:511)
    at java.util.concurrent.FutureTask.run (FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:624)
    at java.lang.Thread.run (Thread.java:748)

I'd like to be able to arrest that exception and divert it to some dead letter construct.

2 Answers2

0

There is no general way of doing it in Beam yet. There are discussions from time to time about modifying the IOs to not return PDone but to my knowledge there is nothing readily available.

At the moment I can think of couple of workarounds, all of them are far from ideal:

  • in the driver program handle the restart of the pipeline when it fails;
  • copy-paste JdbcIO, parts of it, or implement your own Jdbc ParDo with custom exception handling;
  • add an exception handling feature to JdbcIO and contribute it to Beam, it will be appreciated;
Anton
  • 2,431
  • 10
  • 20
0

I was also facing same issue. So, I created custom jdbcio write and returned PCollectionTuple instead of PDone where I classified successfully inserted records and other record which thrown sqlexception while execute batch in WriteFn.

Below is the link for more details: https://sachin4java.blogspot.com/2021/11/extract-error-records-while-inserting.html