Apache Beam: Reading in PCollection as PBegin for a pipeline

Question

I'm debugging this beam pipeline and my end goal is to write all of the strings in a PCollection to a text file.

I've set a breakpoint at the point after the the PCollection I want to inspect is created and what I've been trying to do is create a new Pipeline that

Reads in this output PCollection as the inital input
Prints it to a file (using `TextIO.write().to("/Users/my/local/fp"))

I'm struggling with #1 of how to read in the PCollection as initial input.

The skeleton of what I've been trying:

Pipeline p2 = Pipeline.create();
p2.apply(// READ IN THE PCOLLECTION HERE)
  .apply(TextIO.write().to("/Users/my/local/fp")));
p2.run()

Any thoughts or suggestions would be appreciated

score 0 · Accepted Answer · answered Mar 18 '20 at 17:23

In order to read a pcollection into input, you need to read it from a source. I.e. some data stored in BigQuery, Google Cloud Storage, etc. There are specific source transforms you can use to read from each of these locations. Depending on where you have stored your data you will need to use the correct source and pass in the relevant parameters (i.e. the GCS path, BigQuery table)

Please take a look at the Minimal Word Count Example on the apache beam website (Full source on github). I suggest starting from this code and iterating on it until you build the pipeline you need.

In this example files are read from GCS

p.apply(TextIO.read().from("gs://apache-beam-samples/shakespeare/*"))

Please also see this guide on using IOs and also this list of beam IO transforms. If you just want a basic example working, you can use Create.of to read from variables in your program.

Apache Beam: Reading in PCollection as PBegin for a pipeline

1 Answers1