0

I'm having issues applying a Regex expression to a Split() operation found in the HuggingFace Library. The library requests the following input for Split().

pattern (str or Regex) – A pattern used to split the string. Usually a string or a Regex

In my code I am applying the Split() operation like so:

tokenizer.pre_tokenizer = Split(pattern="[A-Z]+", behavior='isolated')

but it's not working because [A-Z]+ is being interpreted as a string not a Regex expression. I've used the following to no avail:

pattern = re.compile("[A-Z]+")
tokenizer.pre_tokenizer = Split(pattern=pattern, behavior='isolated')

Getting the following error:

TypeError: Can't convert re.compile('[A-Z]+') (re.Pattern) to Union[str, tokenizers.Regex]

Jamie Dimon
  • 467
  • 4
  • 16
  • `"[A-Z]+"` is a string and not a regex expression. `re.compile(...)` is an `re.Pattern` and not `tokenizers.Regex`. Perhaps import tokenizers.Regex from hugging face. – thethiny Feb 26 '21 at 19:10
  • I have no idea where you found out about `tokenizers.Regex` because it's not in the docs, but it worked. – Jamie Dimon Feb 26 '21 at 19:15
  • From the error itself. `TypeError` is an error of the `Type` of the variable. So it told you that it doesn't match `Union[str, tokenizers.Regex]`. This is called Type Hinting, `Union` means `Or`. So it was expecting a `str` or a `tokenizers.Regex`. That's why I suggested it. – thethiny Feb 26 '21 at 20:07

1 Answers1

2

The following solution worked by importing Regex from the tokenizers library:

from tokenizers import Regex

tokenizer.pre_tokenizer = Split(pattern=Regex("[A-Z]+"),
                                behavior='isolated')
Jamie Dimon
  • 467
  • 4
  • 16