1

Objective: Given a string, replace every occurrence of '<?xml version="1.0" encoding="utf-8"?>' and uppercase cousins with the empty string ''.

A string.replace() solution and/or a re.sub() solution would be great. A solution based on the BeautifulSoup module would be considered only as last resort.

  1. Attempt based on string.replace():

    s = '1:<?xml version="1.0" encoding="utf-8"?>\n2:<?xml version="1.0" encoding="UTF-8"?>'
    ## 1:<?xml version="1.0" encoding="utf-8"?>
    ## 2:<?xml version="1.0" encoding="UTF-8"?>
    h = '<?xml version="1.0" encoding="utf-8"?>'
    r = s.replace(h, '')
    ## 1:
    ## 2:<?xml version="1.0" encoding="UTF-8"?>
    

Problem: does not remove occurrences with upper case formatting, as in UTF-8.

  1. Attempt based on re.sub():

    import re
    s = '1:<?xml version="1.0" encoding="utf-8"?>\n2:<?xml version="1.0" encoding="UTF-8"?>'
    ## 1:<?xml version="1.0" encoding="utf-8"?>
    ## 2:<?xml version="1.0" encoding="UTF-8"?>
    h = '<?xml version="1.0" encoding="utf-8"?>'
    r = re.sub(h, '', s, flags=re.IGNORECASE | re.MULTILINE)
    ## 1:<?xml version="1.0" encoding="utf-8"?>
    ## 2:<?xml version="1.0" encoding="UTF-8"?>
    

Problem: does not work at all. And yet, a simpler case works:

    import re
    s = '1:a\n2:A'
    ## 1:a
    ## 2:A
    h = 'a'
    r = re.sub(h, '', s, flags=re.IGNORECASE | re.MULTILINE)
    ## 1:
    ## 2:

I suspect the problem comes from the special characters inside the string, e.g. <?xml, but haven't been able to find a solution.

The <?xml header is introduced into my code by the xml parser via the BeautifulSoup module. I haven't had much success with BeautifulSoup's methods here, e.g. .find_all() and .replace_with(). I tried soup.decode_contents(), which worked for some cases but not others. I'm not posting examples of what I tried, because I'd rather not use the module for the particular task at hand (I have a string, I want to output a string, and do not want BeautifulSoup to otherwise alter the string). With apologies to the BS die-hards. ;-)

PatrickT
  • 10,037
  • 9
  • 76
  • 111
  • References (that did not solve my problem): https://stackoverflow.com/questions/33207503/how-do-i-remove-an-xml-declaration-using-beautifulsoup4, https://stackoverflow.com/questions/36503875/how-to-remove-xml-header-in-beautifulsoup – PatrickT May 07 '20 at 22:11

1 Answers1

1

Yes, the ? and . are regex special characters. You can escape them with, for example re.escape():

import re
s = '1:<?xml version="1.0" encoding="utf-8"?>\n2:<?xml version="1.0" encoding="UTF-8"?>'
h = re.escape('<?xml version="1.0" encoding="utf-8"?>') # <-- put re.escape() around the string
r = re.sub(h, '', s, flags=re.IGNORECASE)               # <-- no need for RE.MULTILINE

print(r)

Prints (the <?xml..?> string is replaced):

1:
2:
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Are there situations where `re.MULTILINE` would be needed? No idea what the docs are saying: "When this flag is specified, ^ matches at the beginning of the string and at the beginning of each line within the string, immediately following each newline. Similarly, the $ metacharacter matches either at the end of the string and at the end of each line (immediately preceding each newline)." (https://docs.python.org/3/howto/regex.html) – PatrickT May 07 '20 at 23:40
  • 1
    @PatrickT Yes, if you want match beginning/end of lines, you need to use `re.M`. For example: [without `re.M`](https://regex101.com/r/9YD3Zx/1) and [with `re.M`](https://regex101.com/r/9YD3Zx/2) - the example with `re.M` matches all lines, without `re.M` matches only last line. – Andrej Kesely May 07 '20 at 23:46
  • Thanks a lot Andrej! In 6 minutes I'll be able to upvote your answer. :-) – PatrickT May 07 '20 at 23:54