0

I have pipeline like this:

Read continuously CSV files -> modify them -> save modified ones

The pattern used for continuously reading of files comes from my custom options.

What I want to do is to be able to modify pattern by updating pipeline. But after update the arg is updated but whole pipeline is working with old arg.

Modified pattern

Unmodified pipeline used pattern

Code fragments:

  static class BigQueryDataPreparatorFn extends DoFn<KV<String, String>, KV<String, String>> {
@ProcessElement
public void processElement(final ProcessContext context) {
  final KV<String, String> element = context.element();
  String beeswaxWinData = element.getValue();
  beeswaxWinData = beeswaxWinData.replace("\\\"", "\"\"");

  final BeeswaxDataflowOptions options =
      context.getPipelineOptions().as(BeeswaxDataflowOptions.class);
  final String key = element.getKey() + "##" + options.getSourcePath();
  context.output(KV.of(key, beeswaxWinData));
}
  }

  static void run(final BeeswaxDataflowOptions options) {
final Pipeline pipeline = Pipeline.create(options);
final PCollection<MatchResult.Metadata> matches =
    pipeline.apply(
        "Read",
        FileIO.match()
            .filepattern(options.getSourcePath() + options.getSourceFilesPattern())
            .continuously(
                Duration.standardSeconds(options.getInterval()), Watch.Growth.<String>never()));

matches
    .apply(FileIO.readMatches().withCompression(GZIP))
    .apply(
        Window.<FileIO.ReadableFile>into(
                FixedWindows.of(Duration.standardSeconds(options.getWindowInterval())))
            .accumulatingFiredPanes()
            .withAllowedLateness(Duration.ZERO)
            .triggering(
                Repeatedly.forever(AfterPane.elementCountAtLeast(1).getContinuationTrigger())))
    .apply(
        "Uncompress",
        MapElements.into(
                TypeDescriptors.kvs(TypeDescriptors.strings(), TypeDescriptors.strings()))
            .via(
                file -> {
                  final String filePath = file.getMetadata().resourceId().toString();
                  try {
                    return KV.of(filePath, file.readFullyAsUTF8String());
                  } catch (final IOException e) {
                    return KV.of(filePath, "");
                  }
                }))
    .apply("Prepare for BigQuery import", ParDo.of(new BigQueryDataPreparatorFn()))
    .apply(
        "Save results",
        FileIO.<String, KV<String, String>>writeDynamic()
            .withCompression(GZIP)
            .by(KV::getKey)
            .withDestinationCoder(StringUtf8Coder.of())
            .via(Contextful.fn(KV::getValue), TextIO.sink())
            .withNumShards(options.getShards())
            .to(options.getOutputPath())
            .withTempDirectory(options.getTempLocation())
            .withNaming(AbsoluteNaming::new));

pipeline.run().waitUntilFinish();
  }

How I run pipeline:

 ./gradlew clean run -Pargs="--update --runner=DataflowRunner --jobName=beeswax-wins-fixer --appName=beeswax-wins-fixer --workerRegion=europe-west1 --project=ozone-analytics-dev --gcpTempLocation=gs://ozone-dataflows/beeswax-wins-fixer/temp --tempLocation=gs://ozone-dataflows/beeswax-wins-fixer/temp --stagingLocation=gs://ozone-dataflows/beeswax-wins-fixer/staging --shards=5 --outputPath=gs://ozone-beeswax-new/logs/ --sourcePath=gs://ozone-beeswax/logs/ --sourceFilesPattern=wins/YYYY=*/MM=*/dd=*/HH=*/mm=*/*.gz --streaming=true --interval=120 --windowInterval=30 --autoscalingAlgorithm=THROUGHPUT_BASED --maxNumWorkers=5 --numWorkers=2 --region=europe-west2"
Kapitalny
  • 651
  • 2
  • 8
  • 17
  • Is this a streaming pipeline? Also. When you say update. Did you use the update feature of the pipeline? Updating a dataflow pipeline will result into a new job. Can you make sure that you are looking at the new job. – Ankur Apr 21 '20 at 00:15
  • Yes - this is streaming pipeline. Updating means updating actually just one parameter of pipeline which is used in continuous file reading. Yes I'm sure that this is updated job because of changed property (sourceFilesPattern is changed but pipeline uses old one) – Kapitalny Apr 21 '20 at 06:08
  • How are you updating it and what does this arg do? Where do you change/update this arg? Can you provide more of your code, so I can further investigate. – Alexandre Moraes Apr 21 '20 at 07:13
  • Ok, I've added whole pipeline code - I just run pipeline with `--update` option with different `sourceFilesPattern` parameter. Pipeline after being updated is still lookin into old pattern although being updated - as I show on screens. I'm not looking into old pipeline - I'm 100% sure that this is updated one – Kapitalny Apr 21 '20 at 08:50
  • The pattern you provided is the old one (where you wrote how you run your code), right? Have you confirmed it is the updated job by checking the job id ? It should be different even though the job name is the same, [here](https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline#UpdateMechanics). – Alexandre Moraes Apr 22 '20 at 11:35

0 Answers0