0

I've developed an apache beam pipeline locally where I run predictions on a sample file.

Locally on my computer I can load the model like this:

with open('gs://newbucket322/my_dumped_classifier.pkl', 'rb') as fid:
     gnb_loaded = cPickle.load(fid)

but when running on google dataflow that doesn't obviously work. I tried changing the path to GS:// but that also obviously does not work.

I also tried this code snippet (from here) that was used to load files:

class ReadGcsBlobs(beam.DoFn):
    def process(self, element, *args, **kwargs):
        from apache_beam.io.gcp import gcsio
        gcs = gcsio.GcsIO()
        yield (element, gcs.open(element).read())

model = (p
     | "Initialize" >> beam.Create(["gs://bucket/file.pkl"])
     | "Read blobs" >> beam.ParDo(ReadGcsBlobs())
    )

but that doesn't work when wanting to load my model, or atleast I cannot use this model variable to call the predict method.

Should be a pretty straightforward task but I can't seem to find a straightforward answer to this.

Anton
  • 581
  • 1
  • 5
  • 23

1 Answers1

1

You can define a ParDo as below

class PerdictOutcome(beam.DoFn):
    """ Format the input to the desired shape"""

    def __init__(self, project=None, bucket_name=None, model_path=None, destination_name=None):
        self._model = None
        self._project = project
        self._bucket_name = bucket_name
        self._model_path = model_path
        self._destination_name = destination_name

    def download_blob(bucket_name=None, source_blob_name=None, project=None, destination_file_name=None):
        """Downloads a blob from the bucket."""
        destination_file_name = source_blob_name
        storage_client = storage.Client(<gs://path">)
        bucket = storage_client.get_bucket(bucket_name)
        blob = bucket.blob(source_blob_name)

        blob.download_to_filename(destination_file_name)
    # Load once or very few times
    def setup(self):
        logging.info(
            "Model Initialization {}".format(self._model_path))
        download_blob(bucket_name=self._bucket_name, source_blob_name=self._model_path,
                      project=self._project, destination_file_name=self._destination_name)
        # unpickle model model
        self._model = pickle.load(open(self._destination_name, 'rb'))

    def process(self, element):
        element["prediction"] = self._model.predict(element["data"])
        return [element]

Then you can invoke this ParDo in your pipeline as below:-

    model = (p
         | "Read Files" >> TextIO...
         | "Run Predictions" >> beam.ParDo(PredictSklearn(project=known_args.bucket_project_id, bucket_name=known_args.bucket_name, model_path=known_args.model_path, destination_name=known_args.destination_name)
      )

Jayadeep Jayaraman
  • 2,747
  • 3
  • 15
  • 26
  • Hi, I can't get this code to work. First of all the download_blob isn't callable without self.download_blob and secondly the download_blob function doesn't take that many arguments. I don't really understand the code, where does destination_name get defined? – Anton Nov 14 '19 at 09:38
  • I have updated the answer. I have not executed the specific code to check for syntax but this should give you an idea of how to run models which are pickled and stored in GCS. If you find this useful please do accept the answer. – Jayadeep Jayaraman Nov 14 '19 at 09:48
  • @JayadeepJayaraman I have very similar code with this but for some reason it seems worker machine is running out of space and job stucks. I gave VM 1000GB and model is just about 50MB or so. Any idea why? – Kazuki Nov 23 '20 at 20:32