4

Does FlatMap and Map function in Apache Beam for python is running on parallel?

(p
      | 'GetJava' >> beam.io.ReadFromText(input)
      | 'GetImports' >> beam.FlatMap(lambda line: startsWith(line, keyword))
      | 'PackageUse' >> beam.FlatMap(lambda line: packageUse(line, keyword))
      | 'TotalUse' >> beam.CombinePerKey(sum)
      | 'Top_5' >> beam.transforms.combiners.Top.Of(5, by_value)
      | 'write' >> beam.io.WriteToText(output_prefix)
   )
Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
mileven
  • 204
  • 3
  • 13

1 Answers1

1

The parallelization in your pipeline occurs after the ReadFromText transform. That transform separates directories into multiple files, and files into segments.

Each segment is processed serially in a single worker, so the output of your first FlatMap transform will go down into the other FlatMap serially - but you will have many instances of FlatMap+FlatMap running over each file segment.

Let me know if that answers your question : )

Pablo
  • 10,425
  • 1
  • 44
  • 67