0

I need to save SCollection elements into different hourly BigQuery tables on the basis of their timestamps. I tried the following methods -

  1. Group elements by (TableName, Iterable[TableRow]) and then save each Iterable[TableRow] to their respective tables using a BigQueryClient instance. This does not work as BigQueryClient is not serializable.

  2. Create an SCollection[TableName, PCollection[TableRow]] and then save each PCollection[TableRow] to its respective BigQuery table using BigQueryIO.Write. To create the PCollection[TableRow] objects I use .map(s => (s._1, sc.pipeline.apply(Create.of(s._2.toList.asJava)))), where sc is an instance of ScioContext. This does not work as ScioContext is not serializable.

Is there a way to stream insert elements into different BigQuery tables?

Kakaji
  • 1,421
  • 2
  • 15
  • 23

2 Answers2

2

To do this with Scio, you can create a custom output transform that writes to a destination specified by a DynamicDestinations object (Apache Beam). The table is dynamically determined by some characteristic of the input element, in this case the event time (hour) of the element.

Custom output transform for BigQuery:

import com.google.api.services.bigquery.model.TableSchema
import com.spotify.scio.bigquery.BigQueryUtil
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.WriteDisposition
import org.apache.beam.sdk.io.gcp.bigquery._
import org.apache.beam.sdk.transforms.PTransform
import org.apache.beam.sdk.values.{PCollection, PDone, ValueInSingleWindow}


def saveAsBigQuery(tblPrefix: String,
                   tblSchema: String,
                   writeDisposition: WriteDisposition):
  PTransform[PCollection[TableRow], PDone] = {

  BigQueryIO.writeTableRows()
    .to(new DynamicDestinations[TableRow, String] {

      override def getTable(tblSuffix: String): TableDestination = {
        // TODO: construct table name
        val tblName = "%s_%s".format(tblPrefix, tblSuffix)
        new TableDestination(tblName, null)
      }

      override def getDestination(tblRow: ValueInSingleWindow[TableRow]): String = {
        // TODO: determine hourly table suffix based on event time in tblRow object
      }

      override def getSchema(destination: String): TableSchema = {
        BigQueryUtil.parseSchema(tblSchema)
      }
    })
    .withWriteDisposition(writeDisposition)
    .asInstanceOf[PTransform[PCollection[TableRow], PDone]]
}

Apply custom output transform using the function above:

import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write


val tblPrefix = "table_prefix"
val tblSchema = "table_schema"  // TODO: this needs to be in valid BigQuery schema format
val writeDisposition = Write.WriteDisposition.WRITE_APPEND

val bqTransform = saveAsBigQuery(
  tblPrefix,
  tblSchema,
  writeDisposition)

// assuming tblRows is an SCollection[TableRow]
tblRows.saveAsCustomOutput("saveAsBigQuery", bqTransform)
Andrew Nguonly
  • 2,258
  • 1
  • 17
  • 23
0

Within Beam, the BigQuery IO transform provides several methods for selecting the table based on the current window. I believe that Dataflow 1.9 had similar methods for window-dependent destinations.

Dataflow 2.0 also includes DynamicDestinations. See the Javadoc for an example that selects the table based on a user ID within each element.

I'm not familiar with Scio, but it seems like exposing the underlying methods from BigQuery IO would be the easiest way to accomplish this.

Ben Chambers
  • 6,070
  • 11
  • 16