This program ingests records from a file, parses and saves the records to the database, and writes failure records to a Cloud Storage bucket. The test file I'm using only creates 3 failure records - when run locally the final step parseResults.get(failedRecords).apply("WriteFailedRecordsToGCS", TextIO.write().to(failureRecordsPath));
executes in milliseconds.
In Dataflow I am running the process with 5 workers. The process hangs indefinitely on the write step even after writing the 3 failure records successfully. I can see that it is hanging in the step WriteFailedRecordsToGCS/WriteFiles/FinalizeTempFileBundles/Reshuffle.ViaRandomKey/Pair with random key.out0
Can anyone let me know why this behaves so differently between DirectRunner and Dataflow? The whole pipeline is below.
StageUtilizationDataSourceOptions options = PipelineOptionsFactory.fromArgs(args).as(StageUtilizationDataSourceOptions.class);
final TupleTag<Utilization> parsedRecords = new TupleTag<Utilization>("parsedRecords") {};
final TupleTag<String> failedRecords = new TupleTag<String>("failedRecords") {};
DrgAnalysisDbStage drgAnalysisDbStage = new DrgAnalysisDbStage(options);
HashMap<String, Client> clientKeyMap = drgAnalysisDbStage.getClientKeys();
Pipeline pipeline = Pipeline.create(options);
PCollectionTuple parseResults = PCollectionTuple.empty(pipeline);
PCollection<String> records = pipeline.apply("ReadFromGCS", TextIO.read().from(options.getGcsFilePath()));
if (FileTypes.utilization.equalsIgnoreCase(options.getFileType())) {
parseResults = records
.apply("ConvertToUtilizationRecord", ParDo.of(new ParseUtilizationFile(parsedRecords, failedRecords, clientKeyMap, options.getGcsFilePath()))
.withOutputTags(parsedRecords, TupleTagList.of(failedRecords)));
parseResults.get(parsedRecords).apply("WriteToUtilizationStagingTable", drgAnalysisDbStage.writeUtilizationRecordsToStagingTable());
} else {
logger.error("Unrecognized file type provided: " + options.getFileType());
}
String failureRecordsPath = Utilities.getFailureRecordsPath(options.getGcsFilePath(), options.getFileType());
parseResults.get(failedRecords).apply("WriteFailedRecordsToGCS", TextIO.write().to(failureRecordsPath));
pipeline.run().waitUntilFinish();