4

I want to store a series of pre-tested regexes in a config file, and read and apply them at runtime.

However, because they're commonly packed with escape characters, by the time I've loaded them up into memory, and populated them into a dict, they've been escaped to death.

How can I preserve the integrity of my regex definitions, so that they will re.compile?

Alternately, given that many of the post-escape strings end up in a form with \x00 characters, how do I convert these back into a form that will be consumed correctly by re.compile?

e.g. I have written in a file, the regex "\btest\b". If I want to put this into a re.compile, I can force it to do so with re.compile(r"\btest\b"). However, I don't want to write this code by hand, I want to lift it from a file, and process it as a variable (I've got 000's of these to deal with here) .

There doesn't seem to be a way to r a string variable, and so I'm left trying to compile with '\x08test\x08', which doesn't do what I want it to.

This must be a fairly regular issue - how do others deal with this problem?

Thomas Kimber
  • 10,601
  • 3
  • 25
  • 42
  • 2
    If you write `\btest\b` as literal text in a file and then read it in, it will be equal to `r'\btest\b'` – Wiktor Stribiżew Oct 25 '18 at 14:04
  • 1
    how about just open the file iterate through the lines and put every line into the re.compile ? Could you post an example how your file looks like? – Sharku Oct 25 '18 at 14:10
  • Being "escaped to death" is just a human problem though. The program would still be reading the files just fine as intended. e.g. `\btest\b` would be read as `\\btest\\b` and would `re.compile()` just fine. – r.ook Oct 25 '18 at 14:12
  • That's not what I'm finding - perhaps thats because I'm reading the file via a csv library, json, or pandas. I don't intend to write the io-level code for what should be a simple config parser. It seems as though there are numerous ways that python starts with `"\btest\b"` (which re.compile accepts) but once it ends up in the "\08test\08" form (which re.compile does not - at least, it doesn't interpret that the same way) there seems to be no simple way to perform the reverse operation. – Thomas Kimber Oct 25 '18 at 14:13
  • @Idlehands if I could force the literal to be "escaped" into `"\\btest\\b"` then yes, that would work I guess - unfortunately, that doesn't seem to be what's happening. – Thomas Kimber Oct 25 '18 at 14:16
  • @ThomasKimber It *should* be though. How are you reading the file? When I tested a file stored with `\btest\b`, `open(file, 'r').read()` returns exactly that, `\\btest\\b`. – r.ook Oct 25 '18 at 14:17
  • OK, so the complication is that the file contains rules that follow a series of different format grammars - to read the grammar, I'm using regex - and one of the allowable rule syntaxes is to accept a regex string. So I'm pulling each rule apart using regex, and one output from that might be a regex string which itself needs to be applied post `re.compile`. But, it could be I've been misdirected, since I've just been testing this operation in the interpreter, which I guess is doing more than would be happening if I lifted the content direct from a file. – Thomas Kimber Oct 25 '18 at 14:23
  • It's a bit hard to follow solely based on your comment. If the answer below didn't solve your problem, I'd recommend you provide a [MCVE] to reproduce the problem so we can understand a bit better the nature of the issue. – r.ook Oct 25 '18 at 14:34
  • This was a case of [PEBKAC](https://en.wiktionary.org/wiki/PEBCAK) I was testing functionality that would (as you guys correctly spotted) expect to take a value from a file, by typing in a similar representation via the interpreter. Since the interpreter "interprets", I managed to mislead myself into thinking this was about raw vs escaped strings.... – Thomas Kimber Oct 25 '18 at 16:43

1 Answers1

4

Like the comment says, there is no need to do anything special.

Imagine a utf-8 encoded text file named regexps.txt with one regex on each line, then creating a list of compiled regexps from that file would be something like:

with open('regexps.txt', encoding='utf8') as f:
    compiled_regexps = [re.compile(line) for line in f]
codeape
  • 97,830
  • 24
  • 159
  • 188