1

It's unclear whether it's safe to download files within a DoFn.

My DoFn will download a ~20MB file (an ML model) to apply to elements in my pipeline. According to the Beam docs, requirements include serializability and thread-compatibility.

An example (1, 2) is very similar to my DoFn. It demonstrates downloading from a GCP storage bucket (as I'm doing w/ DataflowRunner), but I'm not sure this approach is safe.

Should objects be downloaded to an in-memory bytes buffer instead of downloading to disk, or is there another best practice for this use case? I haven't come across a best practice approach to this pattern yet.

Brian Bien
  • 723
  • 8
  • 21
  • 1
    Downloading a file in DoFn makes the DoFn being "DoFn with side effect". I am not aware of best practice of your specific question. But there is a general guidance on "DoFn with side effect": remember Dataflow expects DoFn to be idempotent. It is because Dataflow tries to be exact-once so it could retry on DoFn. Just keep that in mind when you use "DoFn with side effect". – Rui Wang Oct 29 '19 at 20:14
  • 1
    (from search): the code in a DoFn needs to be written such that these duplicate (sequential or concurrent) executions do not cause problems. If the outputs of a DoFn are a pure function of its inputs, then this requirement is satisfied. However, if a DoFn's execution has external side-effects, such as performing updates to external HTTP services, then the DoFn's code needs to take care to ensure that those updates are idempotent and that concurrent updates are acceptable. This property can be difficult to achieve, so it is advisable to strive to keep DoFns as pure functions as much as possible – Brian Bien Oct 29 '19 at 21:23
  • 1
    Not sure if this question is for batch or stream, but if your model is static, you can pass it as a side input to your dofn. – Sach Oct 29 '19 at 22:57
  • 1
    similar answer https://stackoverflow.com/questions/47306715/how-to-read-blob-pickle-files-from-gcs-in-a-google-cloud-dataflow-job/47315281#4731528 – Sach Oct 29 '19 at 23:04
  • @Sach static batch model. I had suspected a side input but haven't come across an example of loading a model from one yet... this is my first Beam pipeline so please do share if you've got such an example – Brian Bien Oct 30 '19 at 15:48

2 Answers2

4

Adding on to this answer.

If your model data is static than you can use below code example to pass your model as side input.

#DoFn to open the model from GCS location
class get_model(beam.DoFn):
    def process(self, element):
        from apache_beam.io.gcp import gcsio
        logging.info('reading model from GCS')
        gcs = gcsio.GcsIO()
        yield gcs.open(element)


#Pipeline to load pickle file from GCS bucket
model_step = (p
              | 'start' >> beam.Create(['gs://somebucket/model'])
              | 'load_model' >> beam.ParDo(get_model())
              | 'unpickle_model' >> beam.Map(lambda bin: dill.load(bin)))

#DoFn to predict the results.
class predict(beam.DoFn):
    def process(self, element, model):
        (features, clients) = element
        result = model.predict_proba(features)[:, 1]
        return [(clients, result)]

#main pipeline to get input and predict results.
_ = (p
     | 'get_input' >> #get input based on source and preprocess it.
     | 'predict_sk_model' >> beam.ParDo(predict(), beam.pvalue.AsSingleton(model_step))
     | 'write' >> #write output based on target.

In case of streaming pipeline if you want to load model again after predefined time, you can check "Slowly-changing lookup cache" pattern here.

Sach
  • 904
  • 8
  • 20
  • Slowly-changing lookup cache is also available under Beam patterns: https://beam.apache.org/documentation/patterns/overview/ – Reza Rokni Oct 31 '19 at 17:03
1

If it is a scikit-learn model then you can look at hosting it in Cloud ML Engine and expose it as a REST endpoint. You can then use something like BagState to optimize invocation of models over the network. More details can be found in this link https://beam.apache.org/blog/2017/08/28/timely-processing.html

Jayadeep Jayaraman
  • 2,747
  • 3
  • 15
  • 26
  • In this particular case, it's not an sklearn model (it's Keras, but conforms to the sklearn Transformer interface) – Brian Bien Oct 29 '19 at 19:20
  • 1
    Cloud ML Engine can be used to host a variety of models developed using different ML frameworks. It can host models developed using Keras + Tensorflow – Jayadeep Jayaraman Oct 30 '19 at 03:40
  • Thanks for the suggestion. I'm not sure it's standardized enough for Cloud ML Engine: it's a custom sklearn transformer that loads a pre-trained Keras model from a GCP bucket in its init – Brian Bien Oct 30 '19 at 15:42