Using TfidfVectorizer with Punkt in Cloud Function

Question

My current understanding of TfidfVectorizer is it requires nltk.download("punkt") to be run before transformation on input data, as all the default tokenizers are available in punkt. Currently, because I use TfidfVectorizer in my Cloud Function, I run nltk.download("punkt") inside of the Cloud Function container, which downloads punkt to /tmp. My issue with this is that I can't guarantee access to the same file system contents for each invocation of my Google Cloud Function because "subsequent calls to the same function will sometimes execute in a different container, so they'll have different /tmp mounts. So you can't use /tmp to communicate between functions" (from this SO question). This leads to punkt needing to be redownloaded anytime the container is switched, and this shows up in the logs of my Cloud Function.

I tried creating a tokenizer deserialized from english.pickle that is part of punkt. Even when passing in this custom tokenizer's tokenize function as tokenizer to TfidVectorizer, the transformation of input data later on ends up failing because of the missing punkt download.

Is there anyway to download punkt into Python's available memory, such that it doesn't get stored in the file system and get wiped when the container is switched? It seems like I need punkt to be downloaded into the file system regardless of whether I pass in a custom tokenizer or let TfidfVectorizer choose its own default tokenizer.

score 1 · Answer 1 · answered Jan 26 '20 at 20:12

You can certainly download files to /tmp and expect to see them there for future function invocations that use the same server instance. You just don't have a guarantee which server instance is going to be used to handle any given event. Server instances will be reused as Cloud Functions sees fit, but it can also deallocate a server instance without warning.

But what you can do is simply check to see if the file you want is already present from a prior execution before doing anything. If the file is present, no download needed - just use the file. If not present, then perform the download.

Using TfidfVectorizer with Punkt in Cloud Function

1 Answers1