Slow BigQuery load job when data comes from Apache Beam JdbcIO

Question

I'm trying to add rows to BigQuery from my Apache Beam pipeline using a BigQuery load job. The initial data I'm processing comes from a Postgresql database and is read into Beam with the JdbcIO datasource:

@Override
public PCollection<Intent> expand(PBegin input) {
    return input.apply(JdbcIO.<Intent>read()
                             .withDataSourceConfiguration(JdbcIO.DataSourceConfiguration
                                 .create(DASHBOARD_DB_DRIVER, getJdbcConnectionString()))
                             .withQuery(QUERY)
                             .withCoder(ProtoCoder.of(Intent.class))
                             .withRowMapper(new IntentParser()));

}

The BigQuery writer:

@Override
public WriteResult expand(PCollection<KV<StateTable, StateRow>> events) {
    Write<KV<StateTable, StateRow>> writer = BigQueryIO.<KV<StateTable, StateRow>>write()
                                                       .withCreateDisposition(CreateDisposition.CREATE_NEVER)
                                                       .withWriteDisposition(WriteDisposition.WRITE_APPEND)
                                                       .withFormatFunction(new StateRowMapper())
                                                       .to(x -> x.getValue().getKey().getDestination());

    return events.apply(writer);
}

When I execute the pipeline I get thousands of the following log entries (one for each row I guess), and this process takes ages (+10 minutes).

INFO: Opening TableRowWriter to gs://piesync-analytics-temp-storage/BigQueryWriteTemp/16b8edc635ac4dfcaf2494b0217be8f8/15a00ba8-70ff-43ad-99f7-0090b9465d0b.
Mar 14, 2018 12:13:46 PM org.apache.beam.sdk.io.gcp.bigquery.TableRowWriter <init>

I checked the temporary files in Google cloud storage (required for the BigQuery load job) and notices that each file only contains the data for 1 row.

I'm worried about this because when I do exactly the same, with the same data, but reading from a file instead of a database (using TextIO) I only get a few dozen of log entries and the temporary files contain thousands of BigQuery records. In this case the process is finished in less than a minute

I didn't specify any windowing or triggering as I'm just trying to read a source once.

I guess I have to enable some kind of batching but I have no idea what and where.

How is StateTable.getDestination() implemented? How many different BigQuery tables are you writing to, in the end? What is JdbcPollIO? (your code uses regular JdbcIO) How long does it take the database to execute your query in general, without Dataflow? Can you include a Dataflow job id? — jkff, Mar 18 '18 at 04:31
@jkff I meant `JdbcIO` instead of `JdbcPollIO`. I'm writing to 3 different tables. And I can't give you a jobId because I'm running my Beam pipeline locally with direct runner. The query I run with JdbcIO runs almost instantly on my local postgresql database. — wardva, Mar 26 '18 at 12:20

Slow BigQuery load job when data comes from Apache Beam JdbcIO

0 Answers0