We are building an integration test for an Apache Beam pipeline and are running into some issues. See below for context...
Details about our pipeline:
- We use
PubsubIO
as our data source (unboundedPCollection
) - Intermediate transforms include a custom
CombineFn
and a very simple windowing/triggering strategy - Our final transform is
JdbcIO
, usingorg.neo4j.jdbc.Driver
to write to Neo4j
Current testing approach:
- Run Google Cloud's Pub/Sub emulator on the machine that the tests are running on
- Build an in-memory Neo4j database and pass its URI into our pipeline options
- Run pipeline by calling
OurPipeline.main(TestPipeline.convertToArgs(options)
- Use Google Cloud's Java Pub/Sub client library to publish messages to a test topic (using Pub/Sub emulator), which
PubsubIO
will read from - Data should flow through the pipeline and eventually hit our in-memory instance of Neo4j
- Make simple assertions regarding the presence of this data in Neo4j
This is intended to be a simple integration test which will verify that our pipeline as a whole is behaving as expected.
The issue we're currently having is that when we run our pipeline it is blocking. We are using DirectRunner
and pipeline.run()
(not pipeline.run().waitUntilFinish()
), but the test seems to hang after running the pipeline. Because this is an unbounded PCollection
(running in streaming mode), the pipeline does not terminate, and thus any code after it is not reached.
So, I have a few questions:
1) Is there a way to run a pipeline and then stop it manually later?
2) Is there a way to run a pipeline asynchronously? Ideally it would just kick off the pipeline (which would then continuously poll Pub/Sub for data) and then move on to the code responsible for publishing to Pub/Sub.
3) Is this method of integration testing a pipeline reasonable, or are there better methods that might be more straightforward? Any info/guidance here would be appreciated.
Let me know if I can provide any additional code/context - thanks!