can python regex module match patterns over non utf encodings?

Question

I want to use this encoding for Tamil language text because it is more consistent with the languages nature, and Unicode encoding severely damages(read more here) the intrinsic features of the fusion of alphabets.

I want to use regex over this encoding. is it possible to do that with python regex module? or should I have to write my own FSM for this?

Python's `re` module can work with regular (Unicode) strings (type `str`) and byte strings (type `bytes`). The UTFs define a mapping between `str` and `bytes`, but the `re` functions have nothing to do with this. Now I don't fully understand what TACE16 exactly tries to replace, but if it's just about replacing UTF8/16/32, then the answer is: "doesn't apply" (the `re` module doesn't care). If TACE16 is about representing the Tamil letters in a way incompatible with Unicode, ie. you need a different string type (other than `str`/`bytes`), then the answer is "no". — lenz, May 30 '20 at 08:22
so If I can map utf strings into series of bytes based on TACE16, the re module will work seamlessly? I'd have to also map the punctuation characters and regex special characters to the same location because in TACE16 only Tamil characters are are defined. also should I rely on utf-16 encoding to work with TACE16 or mix and match utf8 and TACE16 without hassle? — vanangamudi, May 30 '20 at 15:21
`re` will work with byte strings, but you will be writing a lot of hard-to-read escape sequences. For your other questions, I'm not sure. Also note that the linked repo is marked as WIP and hasn't seen any maintenance for three years. — lenz, May 30 '20 at 15:47
Can you explain little more about this? "you will be writing a lot of hard-to-read escape sequences" — vanangamudi, May 31 '20 at 13:47
If I understand correctly, TACE16 uses codepoints in the private-use area of Unicode. So you can just use regular strings, no need to deal with byte-string regexes. — lenz, May 31 '20 at 18:51

can python regex module match patterns over non utf encodings?

0 Answers0