0

This program ingests records from a file, parses and saves the records to the database, and writes failure records to a Cloud Storage bucket. The test file I'm using only creates 3 failure records - when run locally the final step parseResults.get(failedRecords).apply("WriteFailedRecordsToGCS", TextIO.write().to(failureRecordsPath)); executes in milliseconds.

In Dataflow I am running the process with 5 workers. The process hangs indefinitely on the write step even after writing the 3 failure records successfully. I can see that it is hanging in the step WriteFailedRecordsToGCS/WriteFiles/FinalizeTempFileBundles/Reshuffle.ViaRandomKey/Pair with random key.out0

Can anyone let me know why this behaves so differently between DirectRunner and Dataflow? The whole pipeline is below.

        StageUtilizationDataSourceOptions options = PipelineOptionsFactory.fromArgs(args).as(StageUtilizationDataSourceOptions.class);
        final TupleTag<Utilization> parsedRecords = new TupleTag<Utilization>("parsedRecords") {};
        final TupleTag<String> failedRecords = new TupleTag<String>("failedRecords") {};
        DrgAnalysisDbStage drgAnalysisDbStage = new DrgAnalysisDbStage(options);
        HashMap<String, Client> clientKeyMap = drgAnalysisDbStage.getClientKeys();

        Pipeline pipeline = Pipeline.create(options);
        PCollectionTuple parseResults = PCollectionTuple.empty(pipeline);

        PCollection<String> records = pipeline.apply("ReadFromGCS", TextIO.read().from(options.getGcsFilePath()));

        if (FileTypes.utilization.equalsIgnoreCase(options.getFileType())) {
             parseResults = records
                    .apply("ConvertToUtilizationRecord", ParDo.of(new ParseUtilizationFile(parsedRecords, failedRecords, clientKeyMap, options.getGcsFilePath()))
                    .withOutputTags(parsedRecords, TupleTagList.of(failedRecords)));
             parseResults.get(parsedRecords).apply("WriteToUtilizationStagingTable", drgAnalysisDbStage.writeUtilizationRecordsToStagingTable());
        } else {
            logger.error("Unrecognized file type provided: " + options.getFileType());
        }

        String failureRecordsPath = Utilities.getFailureRecordsPath(options.getGcsFilePath(), options.getFileType());
        parseResults.get(failedRecords).apply("WriteFailedRecordsToGCS", TextIO.write().to(failureRecordsPath));

        pipeline.run().waitUntilFinish();
user2008914
  • 33
  • 1
  • 9
  • If I launch the Dataflow process with just one worker the write step behaves the same as it does when I launch with DirectRunner - resolves successfully in 1 second. Why does increasing workers highly hinder the write process? – user2008914 Feb 04 '20 at 15:42
  • 1
    Do your firewall rules allow communication between Dataflow workers? – Guillem Xercavins Feb 04 '20 at 18:32
  • All of the other steps seem to execute fine with multiple workers, but I will check that out. – user2008914 Feb 04 '20 at 18:49
  • 1
    Yes, but steps can be fused and executed sequentially in the same worker. However, this gets stuck in reshuffle step where data needs to be shuffled around – Guillem Xercavins Feb 04 '20 at 19:11
  • 1
    My firewall rules were not configured correctly to allow workers to communicate with one another. Follow the directions on the link below to properly configure your firewall. After following the directions below my issue was completely resolved. https://cloud.google.com/dataflow/docs/guides/routes-firewall#firewall_rules_required_by – user2008914 Mar 06 '20 at 14:25

0 Answers0