0

I am moving my code from Pandas to Pypark for NLP task. I have figured out how to apply tokenization (using Keras built-in library) via a pandas UDF. However, I also want to return the fitted tokenizer (for later use on test data).

As with pandas udfs you can't return anything else apart from 1-1 Column transformations (series, list of series, scaler). Is there any way to do this?

def tokenize_wrapper(text, maxlen, padding_type):
    tokenizer = Tokenizer(num_words=None, char_level=True, oov_token='UNK')

    @pandas_udf('array<decimal>')
    def tokenize(text):
        tokenizer.fit_on_texts(text)
        names = tokenizer.texts_to_sequences(text)
        padded_data = pad_sequences(names, maxlen=maxlen, padding=padding_type, truncating = padding_type)
        data = np.array(padded_data).tolist()
        return pd.Series(data)
    
    tokenized_names = tokenize(text)
    return tokenized_names   
Abdul Wahab
  • 137
  • 2
  • 11

1 Answers1

0

I personally don't know if there's a way to distribute the keras tokenizer so that all processes are able to access it in an asynchronous way. However, I think you can do something like this with StringIndexer.

from pyspark.ml.feature import StringIndexer
import pyspark.sql.functions as F

df = spark.createDataFrame([("Abc Bbb  cdd",)], ["text"])

df = df.withColumn('splitText', F.explode(F.split(F.col('text'), '')))

indexer = StringIndexer(inputCol="splitText", outputCol="indexed", stringOrderType="alphabetAsc")

indexerAgg = (
  indexer
  .fit(df)
  .transform(df)
  .groupBy("text")
  .agg(F.collect_list("splitText").alias("splitText"), F.collect_list("indexed").alias("vector"))
)
indexerAgg.show(truncate=False)

And if you need a better tokenizer than simple split then maybe RegexTokenizer can help.

Took some of this from here

Ken Myers
  • 596
  • 4
  • 21