Save and load a spacy model to a google cloud storage bucket

Question

I have a spacy model and I am trying to save it to a gcs bucket using this format

trainer.to_disk('gs://{bucket-name}/model')

But each time I run this I get this error message

FileNotFoundError: [Errno 2] No such file or directory: 'gs:/{bucket-name}/model'

Also when I create a kubeflow persistent volume and save the model there I can download the model using trainer.load('model') I get this error message

File "/usr/local/lib/python3.7/site-packages/spacy/__init__.py", line 30, in load
    return util.load_model(name, **overrides)
  File "/usr/local/lib/python3.7/site-packages/spacy/util.py", line 175, in load_model
    raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model '/model/'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

I don't understand why I am having these errors as this works perfectly when I run this on my pc locally and use a local path.

score 0 · Answer 1 · answered Nov 30 '20 at 14:06

Cloud storage is not a local disk or a physical storage unit where you can save things directly to.

As you say

this on my pc locally and use a local path

Cloud Storage is virtually not a local path of any other tool in the cloud

If you are using python you will have to create a client using the Storage library and then upload your file using upload_blob i.e.:

from google.cloud import storage


def upload_blob(bucket_name, source_file_name, destination_blob_name):
"""Uploads a file to the bucket."""
# bucket_name = "your-bucket-name"
# source_file_name = "local/path/to/file"
# destination_blob_name = "storage-object-name"

storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(destination_blob_name)

blob.upload_from_filename(source_file_name)

I understand this, but how do i save this model to gcs using this format `trainer.to_disk()` — wyn, Dec 01 '20 at 07:59
Well, since SpaCy have only the method to_disk to store the model you can either save locally the model on a temp folder and then run another process with a code similar to the one that I posted, run all in the same code excecution or try for [GCS Fuse](https://cloud.google.com/storage/docs/gcs-fuse) to mount your Storage Bucket as a file system folder, this will sync the content of your output object to GCS so you won't have to modify your code, just your environment. — Chris32, Dec 01 '20 at 08:21

score 0 · Answer 2 · answered Dec 03 '20 at 05:57

Since you've tagged this question "kubeflow-pipelines", I'll answer from that perspective.

KFP strives to be platform-agnostic. Most good components are cloud-independent. KFP promotes system-managed artifact passing where the components code only writes output data to local files and the system takes it and makes it available for other components.

So, it's best to describe your SpaCy model trainer that way - to write data to local files. Check how all other components work, for example, Train Keras classifier.

Since you want to upload to GCS, do that explicitly, but passing the model output of your trainer to an "Upload to GCS" component:

upload_to_gcs_op = components.load_component_from_url('https://raw.githubusercontent.com/kubeflow/pipelines/616542ac0f789914f4eb53438da713dd3004fba4/components/google-cloud/storage/upload_to_explicit_uri/component.yaml')

def my_pipeline():
   model = train_specy_model(...).outputs['model']

   upload_to_gcs_op(
       data=model,
       gcs_path='gs:/.....',
   )

score 0 · Answer 3 · answered Apr 12 '22 at 15:21

The following implementation assumes you have gsutil installed in your computer. The spaCy version used was 3.2.4. In my case, I wanted everything to be part of a (demo) single Python file, spacy_import_export.py. To do so, I had to use subprocess python library, plus this comment, as follows:

# spacy_import_export.py
    
import spacy
import subprocess  # Will be used later

# spaCy models trained by user, are always stored as LOCAL directories, with more subdirectories and files in it.
PATH_TO_MODEL = "/home/jupyter/"  # Use your own path!

# Test-loading your "trainer" (optional step)
trainer = spacy.load(PATH_TO_MODEL+"model")

# Replace 'bucket-name' with the one of your own:
bucket_name = "destination-bucket-name"
GCS_BUCKET = "gs://{}/model".format(bucket_name)

# This does the trick for the UPLOAD to Cloud Storage:
# TIP: Just for security, check Cloud Storage afterwards: "model" should be in GCS_BUCKET
subprocess.run(["gsutil", "-m", "cp", "-r", PATH_TO_MODEL+"model", GCS_BUCKET])

# This does the trick for the DOWNLOAD:
# HINT: By now, in PATH_TO_MODEL, you should have a "model" & "downloaded_model"
subprocess.run(["gsutil", "-m", "cp", "-r", GCS_BUCKET+MODEL_NAME+"/*", PATH_TO_MODEL+"downloaded_model"])

# Test-loading your "GCS downloaded model" (optional step)
nlp_original = spacy.load(PATH_TO_MODEL+"downloaded_model")

I apologize for the excess of comments, I just wanted to make everything clear, for "spaCy newcomers". I know it is a bit late, but hope it helps.

Save and load a spacy model to a google cloud storage bucket

3 Answers3