How to load my pickeled ML model from GCS to Dataflow/Apache beam

Question

I've developed an apache beam pipeline locally where I run predictions on a sample file.

Locally on my computer I can load the model like this:

with open('gs://newbucket322/my_dumped_classifier.pkl', 'rb') as fid:
     gnb_loaded = cPickle.load(fid)

but when running on google dataflow that doesn't obviously work. I tried changing the path to GS:// but that also obviously does not work.

I also tried this code snippet (from here) that was used to load files:

class ReadGcsBlobs(beam.DoFn):
    def process(self, element, *args, **kwargs):
        from apache_beam.io.gcp import gcsio
        gcs = gcsio.GcsIO()
        yield (element, gcs.open(element).read())

model = (p
     | "Initialize" >> beam.Create(["gs://bucket/file.pkl"])
     | "Read blobs" >> beam.ParDo(ReadGcsBlobs())
    )

but that doesn't work when wanting to load my model, or atleast I cannot use this model variable to call the predict method.

Should be a pretty straightforward task but I can't seem to find a straightforward answer to this.

Can you update your question with function definition of ReadGcsBlobs ? — Jayadeep Jayaraman, Nov 14 '19 at 06:15

Jayadeep Jayaraman · Answer 1 · 2019-11-14T09:45:00.687

You can define a ParDo as below

class PerdictOutcome(beam.DoFn):
    """ Format the input to the desired shape"""

    def __init__(self, project=None, bucket_name=None, model_path=None, destination_name=None):
        self._model = None
        self._project = project
        self._bucket_name = bucket_name
        self._model_path = model_path
        self._destination_name = destination_name

    def download_blob(bucket_name=None, source_blob_name=None, project=None, destination_file_name=None):
        """Downloads a blob from the bucket."""
        destination_file_name = source_blob_name
        storage_client = storage.Client(<gs://path">)
        bucket = storage_client.get_bucket(bucket_name)
        blob = bucket.blob(source_blob_name)

        blob.download_to_filename(destination_file_name)
    # Load once or very few times
    def setup(self):
        logging.info(
            "Model Initialization {}".format(self._model_path))
        download_blob(bucket_name=self._bucket_name, source_blob_name=self._model_path,
                      project=self._project, destination_file_name=self._destination_name)
        # unpickle model model
        self._model = pickle.load(open(self._destination_name, 'rb'))

    def process(self, element):
        element["prediction"] = self._model.predict(element["data"])
        return [element]

Then you can invoke this ParDo in your pipeline as below:-

    model = (p
         | "Read Files" >> TextIO...
         | "Run Predictions" >> beam.ParDo(PredictSklearn(project=known_args.bucket_project_id, bucket_name=known_args.bucket_name, model_path=known_args.model_path, destination_name=known_args.destination_name)
      )

Hi, I can't get this code to work. First of all the download_blob isn't callable without self.download_blob and secondly the download_blob function doesn't take that many arguments. I don't really understand the code, where does destination_name get defined? — Anton, Nov 14 '19 at 09:38
I have updated the answer. I have not executed the specific code to check for syntax but this should give you an idea of how to run models which are pickled and stored in GCS. If you find this useful please do accept the answer. — Jayadeep Jayaraman, Nov 14 '19 at 09:48
@JayadeepJayaraman I have very similar code with this but for some reason it seems worker machine is running out of space and job stucks. I gave VM 1000GB and model is just about 50MB or so. Any idea why? — Kazuki, Nov 23 '20 at 20:32

How to load my pickeled ML model from GCS to Dataflow/Apache beam

1 Answers1

Linked