FlatMap and Map in Apache Beam

Question

Does FlatMap and Map function in Apache Beam for python is running on parallel?

(p
      | 'GetJava' >> beam.io.ReadFromText(input)
      | 'GetImports' >> beam.FlatMap(lambda line: startsWith(line, keyword))
      | 'PackageUse' >> beam.FlatMap(lambda line: packageUse(line, keyword))
      | 'TotalUse' >> beam.CombinePerKey(sum)
      | 'Top_5' >> beam.transforms.combiners.Top.Of(5, by_value)
      | 'write' >> beam.io.WriteToText(output_prefix)
   )

score 1 · Accepted Answer · answered Apr 17 '19 at 17:03

The parallelization in your pipeline occurs after the ReadFromText transform. That transform separates directories into multiple files, and files into segments.

Each segment is processed serially in a single worker, so the output of your first FlatMap transform will go down into the other FlatMap serially - but you will have many instances of FlatMap+FlatMap running over each file segment.

Let me know if that answers your question : )

FlatMap and Map in Apache Beam

1 Answers1