What's the most idiomatic way to create a PCollection of N permutations of an input variable from file?

Question

I'm trying to use Apache Beam to parallelize N trials of a simulation. This simulation is run on a matrix V sourced from a .mat MATLAB file. My first instinct (forgive me, I'm new to Apache Beam/Dataflow) was to extend FileBasedSource, but further investigation convinced me that this is not the case. Most explicitly, Apache Beam documentation indicates, "You should create a new source if you’d like to use the advanced features that the Source API provides," and I don't need any of them—I just want to read a variable (or few)! I eventually stumbled upon https://stackoverflow.com/a/45946643, which is how I now intend to get V (passing the resultant file-like object from FileSystems.open to scipy.io.loadmat).

The next question is how create a PCollection of N permutations of V. The obvious solution is something like beam.Create([permutation(V) for _ in xrange(N)]). However, I was a little thrown off by the comment in the documentation, "The latter is primarily useful for testing and debugging purposes." Maybe a slight improvement is to perform the permutation in a DoFn.

I have one last idea. It sounds a bit stupid (and it may well be stupid), but hear me out on this one (humor my sleep-deprived self). The documentation points to an implementation of CountingSource (and, along with it, ReadFromCountingSource). Is there a benefit to using ReadFromCountingSource(N) over beam.Create(range(N))? If so, is the following (start of a) Pipeline reasonable:

(p | 'trial_count' >> ReadFromCountingSource(N)
   | 'permute_v' >> beam.FlatMap(lambda _, x: permutation(x), V)
   | ...)

Beam/Dataflow experts, what would you recommend?

Thank you!!

deepyaman · Accepted Answer · 2018-02-06T21:01:39.420

To load matrices from a .mat file into a PCollection, derive a PTransform wrapper around scipy.io.loadmat from beam.Create:

class LoadMat(beam.Create):

    def __init__(self, file_name, mdict=None, appendmat=True, **kwargs):
        mat_dict = scipy.io.loadmat(file_name, mdict, appendmat, **kwargs)
        super(LoadMat, self).__init__([mat_dict])

Call this transform as follows:

with beam.Pipeline(options=pipeline_options) as p:
    matrices = (p
                | 'LoadMat' >> LoadMat(FileSystems.open(known_args.input))
                | ...)

Scott Wegner · Answer 2 · 2018-02-05T18:22:55.643

-1

Let me rephrase what I think you're asking, please correct me if I'm wrong:

You have a matrix V in a MATLAB file, which needs to be read in and then run through N trials of a simulation.

EDIT: FileBasedSource cannot be used directly. I've corrected my explanation below.

Apache Beam has built-in PTransforms for reading from many file formats, but not MATLAB files. You'll need to create your own source implementation and Read transform. There are many examples based on FileBasedSource, such as ReadFromTFRecord and ReadFromAvro.

The Beam documentation has tips on implementing new I/O transforms: https://beam.apache.org/documentation/io/authoring-overview/

Adding the logic for permutations will be simpler if it's decoupled from the source. If the number of trials N is static or known during pipeline construction, you can use FlatMap and normal Python code to return an iterable of each permutation. So you're logic would look more like:

(p | 'read' >> ReadFromMatlab(file_name)
   | 'permute_v' >> beam.FlatMap(lambda x: permutation(x, N))
   | ...)

edited Feb 05 '18 at 18:22

answered Feb 02 '18 at 19:31

Scott Wegner

7,263
2
39
55

Sorry, but I think you've misunderstood. My conclusion was that `FileBasedSource` is **not** the correct answer to my problem. For one, you can't just do `FileBasedSource(file_name)`, right? You'd need to subclass `FileBasedSource`, since `read_records` (and maybe other stuff) is not defined for `FileBasedSource`. Furthermore, `FileBasedSource` seems intended for cases where you get a `PCollection` of text lines/JSON objects/something from the file, especially if that read could be parallelized, and none of that is true in my case (especially with `.mat` being a binary format). On permutation: – deepyaman Feb 02 '18 at 19:51
The basic challenge that I'm facing is that the Apache Beam/Dataflow model seems to be well-suited to taking a `PCollection` of `M` objects and transforming it to another `PCollection` of `M` objects, or perhaps grouping by key and reducing the number of output `PCollection` objects, not so much what I want to do: take a `PCollection` of `M` (in my case, 1) objects, and turn it into a `PCollection` of `M * N` objects. I guess your logic is fine for that, assuming the `permutation` call returns a list of `N` objects, like my implementation `[permutation(V) for _ in N]` in the original question. – deepyaman Feb 02 '18 at 19:57
@user1093967 Thanks for pointing out my original answer was incorrect with regards to using `FileBasedSource`. I've updated my answer to reflect that you'll need to subclass and create a read transform. For permutations, `FlatMap` can output many elements for a single input. – Scott Wegner Feb 05 '18 at 18:25
As I wrote in my question and initial comment on your answer, I believe that creating a new sourcing using the Source API is not the answer here. Prior to asking my question, I'd studied the code for `AvroSource` and other examples, and I revisited them over the past few days. I still don't see how it applies in this case. I'll add an answer with the solution I eventually came to yesterday, and you're welcome to see if that seems more reasonable given my use case. – deepyaman Feb 06 '18 at 20:07

What's the most idiomatic way to create a PCollection of N permutations of an input variable from file?

2 Answers2