1

So far, I pre-process text data using numpy and build-in fuctions (such as keras tokenizer class, tf.keras.preprocessing.text.Tokenizer: https://keras.io/api/preprocessing/text/).

And there is were I got stuck: Since I am trying to scale up my model and data set, I am experimenting with spark and spark nlp (https://nlp.johnsnowlabs.com/docs/en/annotators#tokenizer)... however, I couldn´t yet find a similar working tokenizer. The fitted tokenizer must be later available to transform validation/new data.

My output should represent each token as an unique integer value (starting from 1), something like:

[ 10,... ,  64,  555]
[ 1,... , 264,   39]
[ 12,..., 1158, 1770]

Currently, I was able to use the Spark NLP-tokenizer to obtain tokenized words:

[okay,..., reason, still, not, get, background] 
[picture,..., expand, fill, whole, excited]                     
[not, worry,..., happy, well, depend, on, situation]

Does anyone have a solution which doesn´t require to copy the data out of the spark environment?

UPDATE:

I created two CSVs to clarify my current issue. The first file was created thru a pre-processing pipeline: 1. cleaned_delim_text

After that, the delimited words should be "translated" to integer values and the sequence should be padded with zeros to the same length: 2. cleaned_tok_text

Bennimi
  • 416
  • 5
  • 14

1 Answers1

0

Please try below combination -

1. Use tokenizer to convert the statements into words and then

2.use word2vec to compute distributed vector representation of those words

Som
  • 6,193
  • 1
  • 11
  • 22
  • Hi thanks for your answer. I might misconceive it, but W2V is an embedding method, right? I rather just want a "simple" word-to-integer translation as in my example.. not an vector representation. So I end up with a dict, similar to the keras tokenizer. Does this makes sense? – Bennimi Jun 19 '20 at 15:39
  • can you provide sample input and output? – Som Jun 19 '20 at 16:08