My current understanding of TfidfVectorizer
is it requires nltk.download("punkt")
to be run before transformation on input data, as all the default tokenizers are available in punkt
. Currently, because I use TfidfVectorizer
in my Cloud Function, I run nltk.download("punkt")
inside of the Cloud Function container, which downloads punkt
to /tmp
. My issue with this is that I can't guarantee access to the same file system contents for each invocation of my Google Cloud Function because "subsequent calls to the same function will sometimes execute in a different container, so they'll have different /tmp
mounts. So you can't use /tmp
to communicate between functions" (from this SO question). This leads to punkt
needing to be redownloaded anytime the container is switched, and this shows up in the logs of my Cloud Function.
I tried creating a tokenizer deserialized from english.pickle
that is part of punkt
. Even when passing in this custom tokenizer's tokenize
function as tokenizer
to TfidVectorizer
, the transformation of input data later on ends up failing because of the missing punkt
download.
Is there anyway to download punkt
into Python's available memory, such that it doesn't get stored in the file system and get wiped when the container is switched? It seems like I need punkt
to be downloaded into the file system regardless of whether I pass in a custom tokenizer or let TfidfVectorizer
choose its own default tokenizer.