Move files to another GCS folder and perform actions after an apache beam pipeline has been executed

Question

I created a streaming apache beam pipeline that read files from GCS folders and insert them in BigQuery, it works perfectly but it re-process all the files when i stop and run the job,so all the data will be replicated again.

So my idea is to move files from the scanned directory to another one but i don't how technically do it with apache beam.

Thank you


public static PipelineResult run(Options options) {
// Create the pipeline.

        Pipeline pipeline = Pipeline.create(options);

        /*
         * Steps:
         *  1) Read from the text source.
         *  2) Write each text record to Pub/Sub
         */

        LOG.info("Running pipeline");
        LOG.info("Input : " + options.getInputFilePattern());
        LOG.info("Output : " + options.getOutputTopic());

        PCollection<String> collection = pipeline
                .apply("Read Text Data", TextIO.read()
                        .from(options.getInputFilePattern())
                        .watchForNewFiles(Duration.standardSeconds(60), Watch.Growth.<String>never()))

                .apply("Write logs", ParDo.of(new DoFn<String, String>() {
                    @ProcessElement
                    public void processElement(ProcessContext c) throws Exception {
                        LOG.info(c.element());
                        c.output(c.element());
                    }
                }));

        collection.apply("Write to PubSub", PubsubIO.writeStrings().to(options.getOutputTopic()));

        return pipeline.run();
    }

Is your directory receiving new files constantly? Are you looking to keep this pipeline live as it runs? Or do you want to run it once every week /day / month / etc? — Pablo, Sep 24 '19 at 20:26
@Pablo ,Yes i wanted to keep this pipeline in live to process streaming data, so if a file is deposed now it will be processed directly,so code my code works fine but when i re-lunch the job ,it re-process all the data. I find a solution but it doesn't work, the solution is to create a dynamic path ,but seems that apache beam evaluate the code just one time at the beginning of lunching the job and keep the always the first generated path. — Majdi, Sep 26 '19 at 10:01
```` Pipeline pipeline = Pipeline.create(options); String path = "gs://dev_data/"+date.format(date).split("-")[0]+"/"+date.format(date).split("-")[1]+"/"+date.format(date).split("-")[2]+"/*.gz"; PCollection collection = pipeline.apply("Read Text Data", TextIO.read().from(path) .watchForNewFiles(Duration.standardSeconds(60), Watch.Growth.never())))); return pipeline.run(); ``` @Pablo — Majdi, Sep 26 '19 at 10:07

score 1 · Answer 1 · edited Oct 12 '22 at 11:16

A couple tips:

You are normally not expected to stop and rerun a streaming pipeline. Streaming pipelines are more meant to run forever, and be updated sometimes if you want to make changes to the logic.
Nonetheless, it is possible to use FileIO to match a number of files, and move them after they have been processed.

You would write a DoFn class like so: ReadWholeFileThenMoveToAnotherBucketDoFn, which would read the whole file, and then move it to a new bucket.

Pipeline pipeline = Pipeline.create(options);


PCollection<FileIO.Match> matches = pipeline
        .apply("Read Text Data", FileIO.match()
                .filepattern(options.getInputFilePattern())
                .continuously(Duration.standardSeconds(60), 
                                Watch.Growth.<String>never()));

matches.apply(FileIO.readMatches())
        .apply(ParDo.of(new ReadWholeFileThenMoveToAnotherBucketDoFn()))
        .apply("Write logs", ParDo.of(new DoFn<String, String>() {
            @ProcessElement
            public void processElement(ProcessContext c) throws Exception {
                LOG.info(c.element());
                c.output(c.element());
            }
        }));

....

Move files to another GCS folder and perform actions after an apache beam pipeline has been executed

1 Answers1