I'm trying to add rows to BigQuery from my Apache Beam pipeline using a BigQuery load job. The initial data I'm processing comes from a Postgresql database and is read into Beam with the JdbcIO
datasource:
@Override
public PCollection<Intent> expand(PBegin input) {
return input.apply(JdbcIO.<Intent>read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration
.create(DASHBOARD_DB_DRIVER, getJdbcConnectionString()))
.withQuery(QUERY)
.withCoder(ProtoCoder.of(Intent.class))
.withRowMapper(new IntentParser()));
}
The BigQuery writer:
@Override
public WriteResult expand(PCollection<KV<StateTable, StateRow>> events) {
Write<KV<StateTable, StateRow>> writer = BigQueryIO.<KV<StateTable, StateRow>>write()
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withFormatFunction(new StateRowMapper())
.to(x -> x.getValue().getKey().getDestination());
return events.apply(writer);
}
When I execute the pipeline I get thousands of the following log entries (one for each row I guess), and this process takes ages (+10 minutes).
INFO: Opening TableRowWriter to gs://piesync-analytics-temp-storage/BigQueryWriteTemp/16b8edc635ac4dfcaf2494b0217be8f8/15a00ba8-70ff-43ad-99f7-0090b9465d0b.
Mar 14, 2018 12:13:46 PM org.apache.beam.sdk.io.gcp.bigquery.TableRowWriter <init>
I checked the temporary files in Google cloud storage (required for the BigQuery load job) and notices that each file only contains the data for 1 row.
I'm worried about this because when I do exactly the same, with the same data, but reading from a file instead of a database (using TextIO
) I only get a few dozen of log entries and the temporary files contain thousands of BigQuery records. In this case the process is finished in less than a minute
I didn't specify any windowing or triggering as I'm just trying to read a source once.
I guess I have to enable some kind of batching but I have no idea what and where.