I'm trying to use Apache Beam to parallelize N
trials of a simulation. This simulation is run on a matrix V
sourced from a .mat
MATLAB file. My first instinct (forgive me, I'm new to Apache Beam/Dataflow) was to extend FileBasedSource
, but further investigation convinced me that this is not the case. Most explicitly, Apache Beam documentation indicates, "You should create a new source if you’d like to use the advanced features that the Source API provides," and I don't need any of them—I just want to read a variable (or few)! I eventually stumbled upon https://stackoverflow.com/a/45946643, which is how I now intend to get V
(passing the resultant file-like object from FileSystems.open
to scipy.io.loadmat
).
The next question is how create a PCollection
of N
permutations of V
. The obvious solution is something like beam.Create([permutation(V) for _ in xrange(N)])
. However, I was a little thrown off by the comment in the documentation, "The latter is primarily useful for testing and debugging purposes." Maybe a slight improvement is to perform the permutation in a DoFn
.
I have one last idea. It sounds a bit stupid (and it may well be stupid), but hear me out on this one (humor my sleep-deprived self). The documentation points to an implementation of CountingSource
(and, along with it, ReadFromCountingSource
). Is there a benefit to using ReadFromCountingSource(N)
over beam.Create(range(N))
? If so, is the following (start of a) Pipeline
reasonable:
(p | 'trial_count' >> ReadFromCountingSource(N)
| 'permute_v' >> beam.FlatMap(lambda _, x: permutation(x), V)
| ...)
Beam/Dataflow experts, what would you recommend?
Thank you!!