Clean method for storing regex strings in python

Question

I want to store a series of pre-tested regexes in a config file, and read and apply them at runtime.

However, because they're commonly packed with escape characters, by the time I've loaded them up into memory, and populated them into a dict, they've been escaped to death.

How can I preserve the integrity of my regex definitions, so that they will re.compile?

Alternately, given that many of the post-escape strings end up in a form with \x00 characters, how do I convert these back into a form that will be consumed correctly by re.compile?

e.g. I have written in a file, the regex "\btest\b". If I want to put this into a re.compile, I can force it to do so with re.compile(r"\btest\b"). However, I don't want to write this code by hand, I want to lift it from a file, and process it as a variable (I've got 000's of these to deal with here) .

There doesn't seem to be a way to r a string variable, and so I'm left trying to compile with '\x08test\x08', which doesn't do what I want it to.

This must be a fairly regular issue - how do others deal with this problem?

If you write `\btest\b` as literal text in a file and then read it in, it will be equal to `r'\btest\b'` — Wiktor Stribiżew, Oct 25 '18 at 14:04
how about just open the file iterate through the lines and put every line into the re.compile ? Could you post an example how your file looks like? — Sharku, Oct 25 '18 at 14:10
Being "escaped to death" is just a human problem though. The program would still be reading the files just fine as intended. e.g. `\btest\b` would be read as `\\btest\\b` and would `re.compile()` just fine. — r.ook, Oct 25 '18 at 14:12
That's not what I'm finding - perhaps thats because I'm reading the file via a csv library, json, or pandas. I don't intend to write the io-level code for what should be a simple config parser. It seems as though there are numerous ways that python starts with `"\btest\b"` (which re.compile accepts) but once it ends up in the "\08test\08" form (which re.compile does not - at least, it doesn't interpret that the same way) there seems to be no simple way to perform the reverse operation. — Thomas Kimber, Oct 25 '18 at 14:13
@Idlehands if I could force the literal to be "escaped" into `"\\btest\\b"` then yes, that would work I guess - unfortunately, that doesn't seem to be what's happening. — Thomas Kimber, Oct 25 '18 at 14:16
@ThomasKimber It *should* be though. How are you reading the file? When I tested a file stored with `\btest\b`, `open(file, 'r').read()` returns exactly that, `\\btest\\b`. — r.ook, Oct 25 '18 at 14:17
OK, so the complication is that the file contains rules that follow a series of different format grammars - to read the grammar, I'm using regex - and one of the allowable rule syntaxes is to accept a regex string. So I'm pulling each rule apart using regex, and one output from that might be a regex string which itself needs to be applied post `re.compile`. But, it could be I've been misdirected, since I've just been testing this operation in the interpreter, which I guess is doing more than would be happening if I lifted the content direct from a file. — Thomas Kimber, Oct 25 '18 at 14:23
It's a bit hard to follow solely based on your comment. If the answer below didn't solve your problem, I'd recommend you provide a [MCVE] to reproduce the problem so we can understand a bit better the nature of the issue. — r.ook, Oct 25 '18 at 14:34
This was a case of [PEBKAC](https://en.wiktionary.org/wiki/PEBCAK) I was testing functionality that would (as you guys correctly spotted) expect to take a value from a file, by typing in a similar representation via the interpreter. Since the interpreter "interprets", I managed to mislead myself into thinking this was about raw vs escaped strings.... — Thomas Kimber, Oct 25 '18 at 16:43

score 4 · Accepted Answer · answered Oct 25 '18 at 14:12

Like the comment says, there is no need to do anything special.

Imagine a utf-8 encoded text file named regexps.txt with one regex on each line, then creating a list of compiled regexps from that file would be something like:

with open('regexps.txt', encoding='utf8') as f:
    compiled_regexps = [re.compile(line) for line in f]

Clean method for storing regex strings in python

1 Answers1

Linked