2

I created a streaming apache beam pipeline that read files from GCS folders and insert them in BigQuery, it works perfectly but it re-process all the files when i stop and run the job,so all the data will be replicated again.

So my idea is to move files from the scanned directory to another one but i don't how technically do it with apache beam.

Thank you


public static PipelineResult run(Options options) {
// Create the pipeline.

        Pipeline pipeline = Pipeline.create(options);

        /*
         * Steps:
         *  1) Read from the text source.
         *  2) Write each text record to Pub/Sub
         */

        LOG.info("Running pipeline");
        LOG.info("Input : " + options.getInputFilePattern());
        LOG.info("Output : " + options.getOutputTopic());

        PCollection<String> collection = pipeline
                .apply("Read Text Data", TextIO.read()
                        .from(options.getInputFilePattern())
                        .watchForNewFiles(Duration.standardSeconds(60), Watch.Growth.<String>never()))

                .apply("Write logs", ParDo.of(new DoFn<String, String>() {
                    @ProcessElement
                    public void processElement(ProcessContext c) throws Exception {
                        LOG.info(c.element());
                        c.output(c.element());
                    }
                }));

        collection.apply("Write to PubSub", PubsubIO.writeStrings().to(options.getOutputTopic()));

        return pipeline.run();
    }

Majdi
  • 73
  • 8
  • Is your directory receiving new files constantly? Are you looking to keep this pipeline live as it runs? Or do you want to run it once every week /day / month / etc? – Pablo Sep 24 '19 at 20:26
  • @Pablo ,Yes i wanted to keep this pipeline in live to process streaming data, so if a file is deposed now it will be processed directly,so code my code works fine but when i re-lunch the job ,it re-process all the data. I find a solution but it doesn't work, the solution is to create a dynamic path ,but seems that apache beam evaluate the code just one time at the beginning of lunching the job and keep the always the first generated path. – Majdi Sep 26 '19 at 10:01
  • ```` Pipeline pipeline = Pipeline.create(options); String path = "gs://dev_data/"+date.format(date).split("-")[0]+"/"+date.format(date).split("-")[1]+"/"+date.format(date).split("-")[2]+"/*.gz"; PCollection collection = pipeline.apply("Read Text Data", TextIO.read().from(path) .watchForNewFiles(Duration.standardSeconds(60), Watch.Growth.never())))); return pipeline.run(); ``` @Pablo – Majdi Sep 26 '19 at 10:07

1 Answers1

1

A couple tips:

  • You are normally not expected to stop and rerun a streaming pipeline. Streaming pipelines are more meant to run forever, and be updated sometimes if you want to make changes to the logic.
  • Nonetheless, it is possible to use FileIO to match a number of files, and move them after they have been processed.

You would write a DoFn class like so: ReadWholeFileThenMoveToAnotherBucketDoFn, which would read the whole file, and then move it to a new bucket.

Pipeline pipeline = Pipeline.create(options);


PCollection<FileIO.Match> matches = pipeline
        .apply("Read Text Data", FileIO.match()
                .filepattern(options.getInputFilePattern())
                .continuously(Duration.standardSeconds(60), 
                                Watch.Growth.<String>never()));

matches.apply(FileIO.readMatches())
        .apply(ParDo.of(new ReadWholeFileThenMoveToAnotherBucketDoFn()))
        .apply("Write logs", ParDo.of(new DoFn<String, String>() {
            @ProcessElement
            public void processElement(ProcessContext c) throws Exception {
                LOG.info(c.element());
                c.output(c.element());
            }
        }));

....
Rogelio Monter
  • 1,084
  • 7
  • 18
Pablo
  • 10,425
  • 1
  • 44
  • 67