I am feeding my discord server messages into an RNN, so that i can create a chatbot based on those messages. I know tensorflow's tf.keras.preprocessing.text.Tokenizer
can tokenize on a character level, but I wanted to include special tokens, since I want the bot to simulate a person writing multiple messages on discord pressing enter multiple times for each phrase. An example of a sentence would be, with special tokens:
'<START> im a riot <ENTER> ok <ENTER> lets see here <END> '
How can I include special tokens like this in this situation? So far the only way i've found is to use the regex method re.findall
to separate characters and special tokens (re.findall(r'(?:(?:<[\w]+?>)|(?:[\w.,?!:]))
), however, it is slow and I would prefer some sort of tensorflow method to make it portable and to be able to use graph execution on tf.data Datasets.