I am moving my code from Pandas to Pypark for NLP task. I have figured out how to apply tokenization (using Keras built-in library) via a pandas UDF. However, I also want to return the fitted tokenizer (for later use on test data).
As with pandas udfs you can't return anything else apart from 1-1 Column transformations (series, list of series, scaler). Is there any way to do this?
def tokenize_wrapper(text, maxlen, padding_type):
tokenizer = Tokenizer(num_words=None, char_level=True, oov_token='UNK')
@pandas_udf('array<decimal>')
def tokenize(text):
tokenizer.fit_on_texts(text)
names = tokenizer.texts_to_sequences(text)
padded_data = pad_sequences(names, maxlen=maxlen, padding=padding_type, truncating = padding_type)
data = np.array(padded_data).tolist()
return pd.Series(data)
tokenized_names = tokenize(text)
return tokenized_names