TypeError: Can't convert re.compile('[A-Z]+') (re.Pattern) to Union[str, tokenizers.Regex]

Question

I'm having issues applying a Regex expression to a Split() operation found in the HuggingFace Library. The library requests the following input for Split().

pattern (str or Regex) – A pattern used to split the string. Usually a string or a Regex

In my code I am applying the Split() operation like so:

tokenizer.pre_tokenizer = Split(pattern="[A-Z]+", behavior='isolated')

but it's not working because [A-Z]+ is being interpreted as a string not a Regex expression. I've used the following to no avail:

pattern = re.compile("[A-Z]+")
tokenizer.pre_tokenizer = Split(pattern=pattern, behavior='isolated')

Getting the following error:

TypeError: Can't convert re.compile('[A-Z]+') (re.Pattern) to Union[str, tokenizers.Regex]

`"[A-Z]+"` is a string and not a regex expression. `re.compile(...)` is an `re.Pattern` and not `tokenizers.Regex`. Perhaps import tokenizers.Regex from hugging face. — thethiny, Feb 26 '21 at 19:10
I have no idea where you found out about `tokenizers.Regex` because it's not in the docs, but it worked. — Jamie Dimon, Feb 26 '21 at 19:15
From the error itself. `TypeError` is an error of the `Type` of the variable. So it told you that it doesn't match `Union[str, tokenizers.Regex]`. This is called Type Hinting, `Union` means `Or`. So it was expecting a `str` or a `tokenizers.Regex`. That's why I suggested it. — thethiny, Feb 26 '21 at 20:07

score 2 · Accepted Answer · answered Feb 26 '21 at 19:17

2

The following solution worked by importing Regex from the tokenizers library:

from tokenizers import Regex

tokenizer.pre_tokenizer = Split(pattern=Regex("[A-Z]+"),
                                behavior='isolated')

answered Feb 26 '21 at 19:17

Jamie Dimon

467
4
16

TypeError: Can't convert re.compile('[A-Z]+') (re.Pattern) to Union[str, tokenizers.Regex]

1 Answers1