0

I am trying split my pipeline into many smaller pipelines so they execute faster. I am partitioning a PCollection of Google Cloud Storage blobs (PCollection)so that I get a

    PCollectionList<Blob> collectionList

from there I would love to be able to something like:

    Pipeline p2 = Pipeline.create(collectionList.get(0));
    .apply(stuff)
    .apply(stuff)

    Pipeline p3 = Pipeline.create(collectionList.get(1));
    .apply(stuff)
    .apply(stuff)

But I haven't found any documentation about creating an initial PCollection from an already existing PCollection, I'd be very grateful if anyone can point me the right direction. Thanks!

1 Answers1

0

You should look into the Partition transform to split a PCollection into N smaller ones. You can provide a PartitionFn to define how the split is done. You can find below an example from the Beam programming guide:

// Provide an int value with the desired number of result partitions, and a PartitionFn that represents the partitioning function.
// In this example, we define the PartitionFn in-line.
// Returns a PCollectionList containing each of the resulting partitions as individual PCollection objects.
PCollection<Student> students = ...;
// Split students up into 10 partitions, by percentile:
PCollectionList<Student> studentsByPercentile =
    students.apply(Partition.of(10, new PartitionFn<Student>() {
        public int partitionFor(Student student, int numPartitions) {
            return student.getPercentile()  // 0..99
                 * numPartitions / 100;
        }}));

// You can extract each partition from the PCollectionList using the get method, as follows:
PCollection<Student> fortiethPercentile = studentsByPercentile.get(4);
Guillem Xercavins
  • 6,938
  • 1
  • 16
  • 35