0

I have a regex tokenizer:

import re

HTML_SCANNER_REGEX = re.compile(r'</?\w+|\w+[#\+]*|:|\.|\?')
def html_regex_tokenizer(corpus):
   
   return [match.group() for match in re.finditer(HTML_SCANNER_REGEX, corpus)]

I would like to use this to do some basic text classification in TensorFlow:


from tensorflow.keras import layers

max_features = 10_000
sequence_length = 250

vectorize_layer = layers.TextVectorization(
    split=html_regex_tokenizer,
    max_tokens=max_features,
    output_mode='int',
    output_sequence_length=sequence_length)

# Make a text-only dataset (without labels), then call adapt
train_text = raw_train_ds.map(lambda x, y: x)
vectorize_layer.adapt(train_text)

I can see that the corpus attribute is a tensorflow.python.framework.ops.Tensor("StaticRegexReplace:0", shape=(None,), dtype=string) object, but I can't figure out how to run re.finditer against whatever strings are in there.

The data is strings of text and up to one HTML tag surrounding. Running this:

for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(3):
    print('HTML', text_batch.numpy()[i])
    print('Label', label_batch.numpy()[i])

gets this:

HTML b'<b>Job Components:</b>\r\n'
Label 1
HTML b'Proficient in SAS Macro Programming\r\n'
Label 0
HTML b'<div>You\xe2\x80\x99ll investigate trends in data, applying models, algorithms, and statistical tests to provide recommendations and change business processes. You\xe2\x80\x99ll gain exposure across our business by providing consultation on priority analytic projects, as well as guiding more junior team members.</div>\r\n'
Label 0

I've tried to convert the gen_string_ops regex code for my own use, but all there are is match and replace functions (regex_full_match, regex_full_match_eager_fallback, static_regex_replace, etc.). How do I use re.finditer on the array of strings in my layers.TextVectorization(split= function?

Dave Babbitt
  • 1,038
  • 11
  • 20

1 Answers1

1

You should probably use tf operations like tf.strings.regex_replace and tf.strings.split. If you really want to use the re library, which is not compatible with tensors, you will have to use tf.py_function, to wrap the logic in html_regex_tokenizer and run it eagerly. You are, however, better off using tf operations IMO, even when I do not know what your data looks like.

Maybe something like this:

tf.strings.regex_replace('<b>Job Components:</b>\r\n', r'<[^>]*>[#\+]*|:|\.|[\t\n\r]+|[^\w\s\\]+', '')```
AloneTogether
  • 25,814
  • 5
  • 20
  • 39