0

I have a dataflow pipeline that reads data from files in GCS, transforms it and places the results to BQ. I created the tests that checks the expected TableRows are in the PCollection, but how I can verify the data that will be written into BQ is right according to the db schema I provided?

        tableRowPCollection.apply(BigQueryIO.Write.named("Write to table").to(options.getTableName())
            .withCreateDisposition(CREATE_IF_NEEDED).withSchema(someSchema).
                    withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
Adrian
  • 1,973
  • 1
  • 15
  • 28
CCC
  • 2,642
  • 7
  • 40
  • 62

1 Answers1

0

I am assuming that you want to verify that BigQueryIO itself does not corrupt the data in your PCollection?

I don't think there is a good way to do that - except for, of course, actually running the pipeline and then reading the data back and verifying it (you can use BigQueryIO.Read to read the data back and some PAssert's to verify it) - but I was assuming you're looking for something more lightweight.

However, you can also take a look at how BigQueryIO itself is tested, e.g. this test. The important method is withTestServices (both BigQueryIO.Read and BigQueryIO.Write have it) - however, it is a package-local implementation detail and not intended to be used by pipeline writers.

catchiecop
  • 388
  • 6
  • 24
jkff
  • 17,623
  • 5
  • 53
  • 85