Objective: Given a string, replace every occurrence of '<?xml version="1.0" encoding="utf-8"?>'
and uppercase cousins with the empty string ''
.
A string.replace()
solution and/or a re.sub()
solution would be great. A solution based on the BeautifulSoup
module would be considered only as last resort.
Attempt based on
string.replace()
:s = '1:<?xml version="1.0" encoding="utf-8"?>\n2:<?xml version="1.0" encoding="UTF-8"?>' ## 1:<?xml version="1.0" encoding="utf-8"?> ## 2:<?xml version="1.0" encoding="UTF-8"?> h = '<?xml version="1.0" encoding="utf-8"?>' r = s.replace(h, '') ## 1: ## 2:<?xml version="1.0" encoding="UTF-8"?>
Problem: does not remove occurrences with upper case formatting, as in UTF-8
.
Attempt based on
re.sub()
:import re s = '1:<?xml version="1.0" encoding="utf-8"?>\n2:<?xml version="1.0" encoding="UTF-8"?>' ## 1:<?xml version="1.0" encoding="utf-8"?> ## 2:<?xml version="1.0" encoding="UTF-8"?> h = '<?xml version="1.0" encoding="utf-8"?>' r = re.sub(h, '', s, flags=re.IGNORECASE | re.MULTILINE) ## 1:<?xml version="1.0" encoding="utf-8"?> ## 2:<?xml version="1.0" encoding="UTF-8"?>
Problem: does not work at all. And yet, a simpler case works:
import re
s = '1:a\n2:A'
## 1:a
## 2:A
h = 'a'
r = re.sub(h, '', s, flags=re.IGNORECASE | re.MULTILINE)
## 1:
## 2:
I suspect the problem comes from the special characters inside the string, e.g. <?xml
, but haven't been able to find a solution.
The <?xml
header is introduced into my code by the xml
parser via the BeautifulSoup
module. I haven't had much success with BeautifulSoup
's methods here, e.g. .find_all()
and .replace_with()
. I tried soup.decode_contents()
, which worked for some cases but not others. I'm not posting examples of what I tried, because I'd rather not use the module for the particular task at hand (I have a string, I want to output a string, and do not want BeautifulSoup
to otherwise alter the string). With apologies to the BS die-hards. ;-)