I created a streaming apache beam pipeline that read files from GCS folders and insert them in BigQuery, it works perfectly but it re-process all the files when i stop and run the job,so all the data will be replicated again.
So my idea is to move files from the scanned directory to another one but i don't how technically do it with apache beam.
Thank you
public static PipelineResult run(Options options) {
// Create the pipeline.
Pipeline pipeline = Pipeline.create(options);
/*
* Steps:
* 1) Read from the text source.
* 2) Write each text record to Pub/Sub
*/
LOG.info("Running pipeline");
LOG.info("Input : " + options.getInputFilePattern());
LOG.info("Output : " + options.getOutputTopic());
PCollection<String> collection = pipeline
.apply("Read Text Data", TextIO.read()
.from(options.getInputFilePattern())
.watchForNewFiles(Duration.standardSeconds(60), Watch.Growth.<String>never()))
.apply("Write logs", ParDo.of(new DoFn<String, String>() {
@ProcessElement
public void processElement(ProcessContext c) throws Exception {
LOG.info(c.element());
c.output(c.element());
}
}));
collection.apply("Write to PubSub", PubsubIO.writeStrings().to(options.getOutputTopic()));
return pipeline.run();
}