I have a regex tokenizer:
import re
HTML_SCANNER_REGEX = re.compile(r'</?\w+|\w+[#\+]*|:|\.|\?')
def html_regex_tokenizer(corpus):
return [match.group() for match in re.finditer(HTML_SCANNER_REGEX, corpus)]
I would like to use this to do some basic text classification in TensorFlow:
from tensorflow.keras import layers
max_features = 10_000
sequence_length = 250
vectorize_layer = layers.TextVectorization(
split=html_regex_tokenizer,
max_tokens=max_features,
output_mode='int',
output_sequence_length=sequence_length)
# Make a text-only dataset (without labels), then call adapt
train_text = raw_train_ds.map(lambda x, y: x)
vectorize_layer.adapt(train_text)
I can see that the corpus
attribute is a tensorflow.python.framework.ops.Tensor("StaticRegexReplace:0", shape=(None,), dtype=string)
object, but I can't figure out how to run re.finditer
against whatever strings are in there.
The data is strings of text and up to one HTML tag surrounding. Running this:
for text_batch, label_batch in raw_train_ds.take(1):
for i in range(3):
print('HTML', text_batch.numpy()[i])
print('Label', label_batch.numpy()[i])
gets this:
HTML b'<b>Job Components:</b>\r\n'
Label 1
HTML b'Proficient in SAS Macro Programming\r\n'
Label 0
HTML b'<div>You\xe2\x80\x99ll investigate trends in data, applying models, algorithms, and statistical tests to provide recommendations and change business processes. You\xe2\x80\x99ll gain exposure across our business by providing consultation on priority analytic projects, as well as guiding more junior team members.</div>\r\n'
Label 0
I've tried to convert the gen_string_ops
regex code for my own use, but all there are is match and replace functions (regex_full_match
, regex_full_match_eager_fallback
, static_regex_replace
, etc.). How do I use re.finditer
on the array of strings in my layers.TextVectorization(split=
function?