Python regex for select/extract from nested groups

Question

I am trying to process a string with CHAR(int) and NCHAR(int) to convert those instances with their ASCII counter-parts. An example would be something like this:

CHAR(124) + (SELECT TOP 1 CAST(name AS VARCHAR(8000)) FROM (SELECT TOP 1 colid, name FROM [Projects]..[syscolumns]
WHERE xtype=char(85)
AND id = OBJECT_ID(NCHAR(69)+NCHAR(78)+NCHAR(95)+NCHAR(69)+NCHAR(109)+NCHAR(112)+NCHAR(108))

Note that I don't want to do anything to VARCHAR(int), and just to the CHAR(int) and NCHAR(int) parts only. The above should translate to:

|(SELECT TOP 1 CAST(name AS VARCHAR(8000)) FROM (SELECT TOP 1 colid, name FROM [Projects]..[syscolumns] WHERE xtype=U AND id = OBJECT_ID(EN_Empl)

Note that any "+" on either side of CHAR(int) or NCHAR(int) should be removed. I tried the the following:

def conv(m):
    return chr(int(m.group(2)))

print re.sub(r'([\+ ]?n?char\((.*?)\)[\+ ]?)', conv, str, re.IGNORECASE)

where str=the raw string that must be processed.

Somehow, the VARCHAR(8000) is being picked up. If I tweak the regex, the "=" after xtype is going away, rather than just the space and the "+" on either side of a CHAR(int) or NCHAR(int) instance.

Hope someone can pull me out of this.

ADDITIONAL SAMPLE STRINGS:

String "char(124)+(Select Top 1 cast(name as varchar(8000)) from (Select Top 1 colid,name From [Projects]..[syscolumns] Where id = OBJECT_ID(NCHAR(69)+NCHAR(78)+NCHAR(95)+NCHAR(69)+NCHAR(109)+NCHAR(112)+NCHAR(108)))"

Regex: r'(\bn?char\((\d+)\)(?:\s*\+\s*)?)'

Result: "|(Select Top 1 cast(name as varchar(8000)) from (Select Top 1 colid,name From [Projects]..[syscolumns] Where id = OBJECT_ID(ENCHAR(78)+NCHAR(95)+NCHAR(69)+NCHAR(109)+NCHAR(112)+NCHAR(108)))"

dawg · Answer 1 · 2013-12-14T19:00:07.230

You have three issues:

You need to use flags=re.IGNORECASE and not just re.IGNORECASE in re.sub. That is a keyword argument.
You need to use \b to find the word boundary.
You should not use str as a name since you will overwrite the built-in by the same name

This works:

import re

tgt='''\
CHAR(124) + (SELECT TOP 1 CAST(name AS VARCHAR(8000)) FROM (SELECT TOP 1 colid, name FROM [Projects]..[syscolumns]
WHERE xtype=char(85)
AND id = OBJECT_ID(NCHAR(69)+NCHAR(78)+NCHAR(95)+NCHAR(69)+NCHAR(109)+NCHAR(112)+NCHAR(108))'''

pat=r'(\bn?char\((\d+)\)(?:\s*\+\s*)?)'

def conv(m):
    return chr(int(m.group(2)))

print re.sub(pat, conv, tgt, flags=re.IGNORECASE)

More completely:

import re

tgt='''\
CHAR(124) + (SELECT TOP 1 CAST(name AS VARCHAR(8000)) FROM (SELECT TOP 1 colid, name FROM [Projects]..[syscolumns]
WHERE xtype=char(85)
AND id = OBJECT_ID(NCHAR(69)+NCHAR(78)+NCHAR(95)+NCHAR(69)+NCHAR(109)+NCHAR(112)+NCHAR(108))'''

pat=r'(\bn?char\((\d+)\)(?:\s*\+\s*)?)'

def conv(m):
    return chr(int(m.group(2)))

print re.sub(r'''
              (                                 # group 1
              \b                                # word boundary
              n?char                            # nchar or char
              \(                                # literal left paren
              (\s*\d+\s*)                       # digits surrounded by spaces
              \)                                # literal right paren
              (?:\s*\+\s*)?                     # optionally followed by a concating '+' 
              )                                 '''
            , conv, tgt, flags=re.VERBOSE | re.IGNORECASE)

Prints:

|(SELECT TOP 1 CAST(name AS VARCHAR(8000)) FROM (SELECT TOP 1 colid, name FROM [Projects]..[syscolumns]
WHERE xtype=U
AND id = OBJECT_ID(EN_Empl)

I used your expression and I think I am a step closer! However, if you look at the new sample I have provided in my question alonside the regex you provided, you will observe that a series of nchar(int) instances separated by "+" seems to convert only the first instance and the remaining instances are preserved as is. Any suggestions? NCHAR(69) was replaced with "E" but everything else stayed the same. — Web User, Dec 14 '13 at 18:57
Yes I used the IGNORECASE flag. @Tim Peters' answer worked for me. I think the difference between your answer and his is that he added the "+" and whitespace portion before and after, which was needed. Thanks for your help!! — Web User, Dec 14 '13 at 19:11

score 1 · Accepted Answer · answered Dec 14 '13 at 18:39

1

You can go a long way just by adding the word boundary (\b) assertion, but I'd like to suggest that you (1) use re.VERBOSE to write a regexp someone can understand later; (2) compile the regexp to reduce clutter at the call site; and, (3) tighten some of the matching criteria. Like so:

def conv(m):
    return chr(int(m.group(1)))

pat = re.compile(r"""[+\s]*    # optional whitespace or +
                     \b        # word boundary
                     n?char    # NCHAR or CHAR
                     \(        # left paren
                     ([\d\s]+) # digits or spaces - group 1
                     \)        # right paren
                     [+\s]*    # optional whitespace or +
                  """, re.VERBOSE | re.IGNORECASE)
print pat.sub(conv, data)

Note that I changed your str to data: str is the name of a heavily used builtin function, and it's a Really Bad Idea to create a variable with the same name.

answered Dec 14 '13 at 18:39

Tim Peters

67,464
13
126
132

Thanks @Tim Peters. Appreciate the suggestion to improve readability; it helps me too (let alone someone else!). I did try it out and it appears to work! One thing I don't understand about the grouping+search+replace works in your regex. I was originally creating one group (for the value to be converted to its ASCII equivalent) within another group (that encapsulated the "+" and whitespace around the [N]CHAR(int) instances). Your regex removes any surrounding "+" or whitespace even though it is not part of the group. I've to spend more time with regex fundamentals. Thanks for your help! – Web User Dec 14 '13 at 19:07
1

You're welcome :-) `sub()` replaces the entire substring matched by the regexp, so there was really no need for the outermost group. That's why I removed it. We do still need a group to isolate the digits, though, so that `conv()` can find them easily. But the output of `conv()` replaces the *entire* substring matched by the regexp. Maybe a little subtle at first, but you'll get used to it quickly ;-) – Tim Peters Dec 14 '13 at 19:11
Thanks @Tim Peters for the very helpful explanation and answer! – Web User Dec 14 '13 at 19:12
I have another instance of these strings which contains something like `_char(75)` or `_nchar(65)` the conversion is skipping. This I understand is because "_" is considered as part of the criteria for a word boundary, i.e. `[a-zA-Z0-9_]`. How do I handle this using the above pattern that works for the majority of cases? – Web User Dec 14 '13 at 23:05
Add `_?` to the regexp after the `\b` line. And notice how easy it is to change a regexp when it's written in little pieces spread across multiple lines ;-) – Tim Peters Dec 14 '13 at 23:16
It is significantly better writing the regex in a multi-line, modular format! I added the `_?` on a separate line after the `\b` line. But it does not work. Here is a sample string: `"nchar(75)+char(66)3154and_char(124)+db_name()=0"`. After conversion, here is the resulting string: `'KB3154and_char(124)+db_name()=0'`. Notice that `char(124)` is not converted. – Web User Dec 14 '13 at 23:45
It doesn't match because `and_char` is "one word". Nobody can solve that for you: you have to decide exactly what you do, and don't, want to match. It's even possible there is no mechanical way to do the right thing in all cases. Do you want to match, e.g., `frizzle_char(61)`? I can't guess. There's no obvious lexical difference between `frizzle_char` and `and_char` - any "kinda general" regexp will match both of those or neither of those. – Tim Peters Dec 15 '13 at 00:38
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/43196/discussion-between-web-user-and-tim-peters) – Web User Dec 15 '13 at 00:52
Sorry, can't do chat - I peek in here when I find a minute to spare, which isn't often ;-) Suggest you get as clear as possible about what you *want*, then open a new question. – Tim Peters Dec 15 '13 at 00:59
Understand. The option to chat popped up and I did not want to stretch this thread since any additional exchange befitted an offline discussion. I understand what you are saying. Thanks! – Web User Dec 15 '13 at 01:00
The goal is to find all instances of char(int) and nchar(int) and convert them to their ASCII equivalent. The edge case is that if I find a "+" or a space on either side of the main pattern, i.e. `n?char(\d+)` then I must eliminate them. Using a word boundary simplifies a generalized form of this problem, but in this case, I think I may have to substitute. This appears to work although I am still in the process of testing it: `r'[+\s]*([_]?)n?char((\d+))[+\s]*'`. Problem with `\b` (word boundary) in this particular use case is that "_" cannot lie within the boundary. – Web User Dec 15 '13 at 01:03

score 0 · Answer 3 · answered Dec 14 '13 at 18:26

0

You only need to use a word boundary \b:

def conv(m):
    return chr(int(m.group(1)))

print re.sub(r'\bn?char\(([^)]+)\)(?:\s*\+\s*)?', conv, str, re.IGNORECASE)

answered Dec 14 '13 at 18:26

Casimir et Hippolyte

88,009
5
94
125

Python regex for select/extract from nested groups

3 Answers3