String replacement: Multiline, Case-Insensitive with Special Characters

Question

Objective: Given a string, replace every occurrence of '<?xml version="1.0" encoding="utf-8"?>' and uppercase cousins with the empty string ''.

A string.replace() solution and/or a re.sub() solution would be great. A solution based on the BeautifulSoup module would be considered only as last resort.

Attempt based on string.replace():

s = '1:<?xml version="1.0" encoding="utf-8"?>\n2:<?xml version="1.0" encoding="UTF-8"?>'
## 1:<?xml version="1.0" encoding="utf-8"?>
## 2:<?xml version="1.0" encoding="UTF-8"?>
h = '<?xml version="1.0" encoding="utf-8"?>'
r = s.replace(h, '')
## 1:
## 2:<?xml version="1.0" encoding="UTF-8"?>

Problem: does not remove occurrences with upper case formatting, as in UTF-8.

Attempt based on re.sub():

import re
s = '1:<?xml version="1.0" encoding="utf-8"?>\n2:<?xml version="1.0" encoding="UTF-8"?>'
## 1:<?xml version="1.0" encoding="utf-8"?>
## 2:<?xml version="1.0" encoding="UTF-8"?>
h = '<?xml version="1.0" encoding="utf-8"?>'
r = re.sub(h, '', s, flags=re.IGNORECASE | re.MULTILINE)
## 1:<?xml version="1.0" encoding="utf-8"?>
## 2:<?xml version="1.0" encoding="UTF-8"?>

Problem: does not work at all. And yet, a simpler case works:

    import re
    s = '1:a\n2:A'
    ## 1:a
    ## 2:A
    h = 'a'
    r = re.sub(h, '', s, flags=re.IGNORECASE | re.MULTILINE)
    ## 1:
    ## 2:

I suspect the problem comes from the special characters inside the string, e.g. <?xml, but haven't been able to find a solution.

The <?xml header is introduced into my code by the xml parser via the BeautifulSoup module. I haven't had much success with BeautifulSoup's methods here, e.g. .find_all() and .replace_with(). I tried soup.decode_contents(), which worked for some cases but not others. I'm not posting examples of what I tried, because I'd rather not use the module for the particular task at hand (I have a string, I want to output a string, and do not want BeautifulSoup to otherwise alter the string). With apologies to the BS die-hards. ;-)

References (that did not solve my problem): https://stackoverflow.com/questions/33207503/how-do-i-remove-an-xml-declaration-using-beautifulsoup4, https://stackoverflow.com/questions/36503875/how-to-remove-xml-header-in-beautifulsoup — PatrickT, May 07 '20 at 22:11

score 1 · Accepted Answer · answered May 07 '20 at 23:09

1

Yes, the ? and . are regex special characters. You can escape them with, for example re.escape():

import re
s = '1:<?xml version="1.0" encoding="utf-8"?>\n2:<?xml version="1.0" encoding="UTF-8"?>'
h = re.escape('<?xml version="1.0" encoding="utf-8"?>') # <-- put re.escape() around the string
r = re.sub(h, '', s, flags=re.IGNORECASE)               # <-- no need for RE.MULTILINE

print(r)

Prints (the <?xml..?> string is replaced):

1:
2:

answered May 07 '20 at 23:09

Andrej Kesely

168,389
15
48
91

Are there situations where `re.MULTILINE` would be needed? No idea what the docs are saying: "When this flag is specified, ^ matches at the beginning of the string and at the beginning of each line within the string, immediately following each newline. Similarly, the $ metacharacter matches either at the end of the string and at the end of each line (immediately preceding each newline)." (https://docs.python.org/3/howto/regex.html) – PatrickT May 07 '20 at 23:40
1

@PatrickT Yes, if you want match beginning/end of lines, you need to use `re.M`. For example: [without `re.M`](https://regex101.com/r/9YD3Zx/1) and [with `re.M`](https://regex101.com/r/9YD3Zx/2) - the example with `re.M` matches all lines, without `re.M` matches only last line. – Andrej Kesely May 07 '20 at 23:46
Thanks a lot Andrej! In 6 minutes I'll be able to upvote your answer. :-) – PatrickT May 07 '20 at 23:54

String replacement: Multiline, Case-Insensitive with Special Characters

1 Answers1