0

I have read some documents with textFile, and did a flatMap of the single words, adding some extra information for each word:

val col = sc.textFile(args.getOrElse("input","documents/*"))
    .flatMap(_.split("\\s+").filter(_.nonEmpty))
val mapped = col.map(t => t + ": " + extraInformation())

I am currently saving this to text easily

mapped.saveAsTextFile(args.getOrElse("output", "results"))

But I cannot figure out how to save the map to a BigQuery schema. All examples I have seen create the initial Scollection from BigQuery and then save it to another table, so the initial collection is [TableRow] instead of [String].

What is the correct approach here? Should I investigate how to convert my data to a kind of collection Big Query will accept? Or should I try to investigate further how to push this plain text straight into a table?

telex-wap
  • 832
  • 1
  • 9
  • 30

2 Answers2

3

I would suggest using the @BigQueryType.toTable annotation on a case class, like so:

import com.spotify.scio.bigquery._

object MyScioJob {

  @BigQueryType.toTable
  case class WordAnnotated(word: String, extraInformation: String)


  def main(args: Array[String]): Unit = {
    // ...job setup logic

    sc.textFile(args.getOrElse("input","documents/*"))
      .flatMap(_.split("\\s+").filter(_.nonEmpty))
      .map(t => WordAnnotated(t, extraInformation())
      .saveAsTypedBigQuery("myProject:myDataset.myTable")
  }
}

There's more information about this on the Scio wiki.

0

In order to write to BigQuery you need to define a TableSchema:

public static final TableSchema BQ_TABLE_SCHEMA = new TableSchema();
public static final List<TableFieldSchema> BQ_FIELDS;

static {
    TableFieldSchema string_field = new TableFieldSchema()
            .setName("string_field")
            .setType(FieldType.STRING.toString())
            .setMode(FieldMode.NULLABLE.toString());

    BQ_FIELDS = Lists.newArrayList(
            string_field
    );

    BQ_TABLE_SCHEMA.setFields(BQ_FIELDS);
}

And you then need to transform your String to a TableRow object:

.apply("ConvertToTableRow", ParDo.of(new DoFn<String, TableRow>() {
                    @ProcessElement
                    public void processElement(ProcessContext c) {
                        c.output(new TableRow().set("string_field", c.element()));
                    }
                }))
.apply("InsertTableRowsToBigQuery",
                        BigQueryIO.writeTableRows().to("project_id:dataset_name.table_name")
                                .withSchema(BQ_TABLE_SCHEMA)
                                .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
                                .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND))
                .getFailedInserts();

You can also take a look at this example in Java, it is very similar to what needs to be done in Scio: https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/complete/StreamingWordExtract.java#L78

Juta
  • 411
  • 1
  • 5
  • 12