1

I'm building a pipeline to backup data from PubSub into GCS and wanted to create a test using JobTest and I'm struggling to get the PubSubIO to properly get the event time.

PubSub is read using sc.pubsubSubscriptionWithAttributes[String]("path/to/subscription", timestampAttribute = "doc_timestamp"). After this I apply windowing and send it to a CustomIO

The test looks like this:

JobTest[PubSub2GCS.type]
  .args("--subscription=input", "--targetDir=output")
  .input(PubsubIO[(String, Map[String, String])]("input"), Seq(("Contents", Map[String, String]("doc_timestamp" -> "2001-01-01T09:10:11.332Z"))))
  .output(CustomIO[KV[String, WindowedDoc]]("output"))(_.debug())
  .run()

and the result is that the value is placed in the -290308-12-21T20:00:00.000Z..-290308-12-21T21:00:00.000Z window!!. Possibly because the date on "doc_timestamp" is not properly interpreted. Actually, the window never changes, regardless of the value on "doc_timestamp" key.

Luckily the job works fine when running in production, but I'd like to have this tests written.

Carlos
  • 2,883
  • 2
  • 18
  • 19

1 Answers1

3

This is because Map[String, String] attributes in ScioContext#pubsubSubscriptionWithAttributes are not populated in JobTest.

We can probably add a condition here, and set timestamp if ScioContext#isTest and timestampAttribute != null https://github.com/spotify/scio/blob/master/scio-core/src/main/scala/com/spotify/scio/ScioContext.scala#L572

Seems like a trivia fix. Can you please file an issue here and maybe submit a PR?

Neville Li
  • 420
  • 3
  • 10