0

I'm working with XML document, where I edit text and attributes of tags, however I encountered a problem. ElementTree namespace registration does not work properly. The process is that I parse XML document, strip namespaces, register them in order to preserve them as the are on the input, make changes in some tags and then save(write) the final document. The problem is that it does not read all of the namespaces and edit them after saving. When I debug script, it shows however (as I believe), that namespaces are kept.

Here is a simple code snippet reading my XML document, trying to preserve namespaces and then saving the document.

import xml.etree.ElementTree as ET

def convert_ADI(adiPath):
    tree = ET.parse(adiPath)
    root = tree.getroot()
    namespaces = dict([elem for _, elem in ET.iterparse(adiPath, events=['start-ns'])])
    for ns in namespaces:
        ET.register_namespace(ns, namespaces[ns])
    tree.write("ADI_edited_test.xml", encoding = "utf-8", xml_declaration = True)

convert_ADI(r'C:\Users\user\python\pADI\Document.XML')

Here is original XML document:

<?xml version="1.0" encoding="utf-8"?>
<ADI3 xmlns="urn:cablelabs:md:xsd:core:3.0"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xmlns:content="urn:cablelabs:md:xsd:content:3.0"
      xmlns:core="urn:cablelabs:md:xsd:core:3.0"
      xmlns:offer="urn:cablelabs:md:xsd:offer:3.0"
      xmlns:terms="urn:cablelabs:md:xsd:terms:3.0"
      xmlns:title="urn:cablelabs:md:xsd:title:3.0"
      xmlns:adb="urn:adb:md:xsd:adb:01"
      xmlns:schemaLocation="urn:adb:md:xsd:adb:01 ADB-EXT-C01.xsd urn:cablelabs:md:xsd:core:3.0 MD-SP-CORE-C01.xsd urn:cablelabs:md:xsd:content:3.0 MD-SP-CONTENT-C01.xsd urn:cablelabs:md:xsd:offer:3.0 MD-SP-OFFER-C01.xsd urn:cablelabs:md:xsd:terms:3.0 MD-SP-TERMS-C01.xsd urn:cablelabs:md:xsd:title:3.0 MD-SP-TITLE-C01.xsd"
      xmlns:xml="http://www.w3.org/XML/1998/namespace">
  <Asset xsi:type="title:TitleType" uriId="ID" providerVersionNum="5"
     internalVersionNum="0" creationDateTime="2020-04-22T00:00:00Z"
     startDateTime="2020-03-24T09:00:00Z" endDateTime="2022-10-06T23:59:00Z">
    <AlternateId identifierSystem="VOD1.1">ID</AlternateId>
    <ProviderQAContact>Contact</ProviderQAContact>
    <Ext>
      <adb:ExtensionType>
        <adb:TitleExt>
          <adb:SeriesInfo episodeNumber="16">
            <adb:series seriesId="106585" seasonCount="2"/>
            <adb:season seasonId="106586" number="1" episodeCount="22"/>
          </adb:SeriesInfo>
        </adb:TitleExt>
      </adb:ExtensionType>
    </Ext>
    <title:LocalizableTitle xml:lang="pol">
      <title:TitleLong>BATWOMAN EP. 16 - THROUGH THE LOOKING GLASS</title:TitleLong>
      <title:SummaryLong> Very long summary...</title:SummaryLong>
      <title:Actor fullName="Ruby Rose" firstName="Ruby" lastName="Rose"/>
      <title:Actor fullName="Rachel Skarsten" firstName="Rachel" lastName="Skarsten"/>
      <title:Actor fullName="Meagan Tandy" firstName="Meagan" lastName="Tandy"/>
      <title:Actor fullName="Camrus Johnson" firstName="Camrus" lastName="Johnson"/>
      <title:Director fullName="Sudz Sutherland" firstName="Sudz" lastName="Sutherland"/>
    </title:LocalizableTitle>
    <title:Rating ratingSystem="PL">12</title:Rating>
    <title:DisplayRunTime>00:40</title:DisplayRunTime>
    <title:Year>2019</title:Year>
    <title:CountryOfOrigin>US</title:CountryOfOrigin>
    <title:Genre>Genre</title:Genre>
    <title:ShowType>Movie</title:ShowType>
  </Asset>
  <Asset xsi:type="offer:CategoryType" uriId="ID">
    <AlternateId identifierSystem="VOD1.1">ID</AlternateId>
    <offer:CategoryPath>Path</offer:CategoryPath>
  </Asset>
  <Asset xsi:type="content:MovieType" uriId="namemp4">
    <AlternateId identifierSystem="VOD1.1">namemp4</AlternateId>
    <content:SourceUrl>name.mp4</content:SourceUrl>
    <content:Resolution>resolution</content:Resolution>
    <content:Duration>PT0H40M40S</content:Duration>
    <content:Language>pol</content:Language>
    <content:SubtitleLanguage>pol</content:SubtitleLanguage>
    <content:SubtitleLanguage>eng</content:SubtitleLanguage>
  </Asset>
  <Asset uriId="ID" xsi:type="content:MovieType">
    <AlternateId identifierSystem="VOD1.1">ID</AlternateId>
    <Provider>Prov</Provider>
    <content:SourceUrl>sub.srt</content:SourceUrl>
  </Asset>
  <Asset uriId="ID" xsi:type="content:MovieType">
    <AlternateId identifierSystem="VOD1.1">ID</AlternateId>
    <Provider>Prov</Provider>
    <content:SourceUrl>sub.srt</content:SourceUrl>
  </Asset>
  <Asset xsi:type="content:PosterType" uriId="ID">
    <AlternateId identifierSystem="VOD1.1">ID</AlternateId>
    <content:SourceUrl>poster.jpg</content:SourceUrl>
    <content:X_Resolution>700</content:X_Resolution>
    <content:Y_Resolution>1000</content:Y_Resolution>
    <content:Language>pol</content:Language>
  </Asset>
  <Asset xsi:type="offer:ContentGroupType" uriId="ID">
    <AlternateId identifierSystem="VOD1.1">ID</AlternateId>
    <offer:TitleRef uriId="ID"/>
    <offer:MovieRef uriId="namets"/>
    <offer:MovieRef uriId="subs"/>
    <offer:MovieRef uriId="subs"/>
  </Asset>
  <Asset xsi:type="offer:ContentGroupType" uriId="ID">
    <AlternateId identifierSystem="VOD1.1">ID</AlternateId>
    <offer:TitleRef uriId="ID"/>
    <offer:MovieRef uriId="poster"/> 
  </Asset>
</ADI3>

The result of just reading, and trying to write document without any changes is that the namespaces miss many fields:

<core:ADI3 xmlns:adb="urn:adb:md:xsd:adb:01" xmlns:content="urn:cablelabs:md:xsd:content:3.0" xmlns:core="urn:cablelabs:md:xsd:core:3.0" xmlns:offer="urn:cablelabs:md:xsd:offer:3.0" xmlns:title="urn:cablelabs:md:xsd:title:3.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

additionaly, to all ADI3 tags and Asset tags core is added. I would like to keep it the same as it was on input. Thank you in advance for any tip bringing me closer to the solution.

Edit: Iterating through file using ET was bit simpler for me. To remove Ext tag it was needed to find Asset tag, and then find Ext in Asset. E.g.

namespaces = dict([elem for _, elem in ET.iterparse(adiPath, events=['start-ns'])])
for ns in namespaces:
    ET.register_namespace(ns, namespaces[ns])
for asset in root.findall('.//{*}Asset'):
    # If "title:TitleType" value in "type" attribute in Asset tag, change AlternateId path to correct one from config, remove 'Ext' tag and adjust rating
    if 'title:TitleType' in asset.attrib.values():
        # Find and remove 'Ext' tag 
        ext = asset.find('.//{*}Ext')
        if ext != None:
            asset.remove(ext)

This is very logical for me. Is there similar way to traverse through tags and when certain tag found, find child element of a tag?

Currently to get into Ext using lxml I use:

nsmap = {}
for ns in root.xpath('//namespace::*'):
    if ns[0]:
        nsmap[ns[0]] = ns[1]
ext = root.xpath('.//core:Ext', namespaces=nsmap)

And I'm not certain how to remove such element.

  • 1
    ElementTree removes declarations for namespaces that are not actually used in the XML document. See https://stackoverflow.com/q/45990761/407651. – mzjn Apr 28 '20 at 09:51
  • 1
    And `xmlns:schemaLocation` does not look right. It should be `xsi:schemaLocation`. – mzjn Apr 28 '20 at 09:57
  • Thank you for help! I changed `xmlns:schemaLocation ` to `xsi:schemaLocation` as You sugested, however I didn't notice any difference. But this sugestion brought me the conclusion, that maybe input xml is wrong somehow. For now, after deleting `core` namespace when namespaces are already stripped, when file is saved, there are no `core` prefixes before `Asset` and `ADI` tags. Thank you again for help! – warezsoftwarez Apr 28 '20 at 10:42
  • 1
    The `urn:cablelabs:md:xsd:core:3.0` namespace is used in two declarations: once to declare it as the default namespace and once associating it with the `core` prefix. That seems strange. – mzjn Apr 28 '20 at 10:49
  • So accordingly to attached post, it is not possible to preserve namespaces not used in document by `ET`, I have to use `lxml`? – warezsoftwarez Apr 28 '20 at 12:15
  • You will have fewer problems with lxml. – mzjn Apr 28 '20 at 12:48
  • is there an option to use wildcards as in `ET` case? I found this snippet to strip namespaces: `nsmap = {} for ns in root.xpath('//namespace::*'): if ns[0]: nsmap[ns[0]] = ns[1]` However to get to an element and delete it I have to use following: `for bad in root.xpath('.//core:Ext', namespaces=nsmap): bad.getparent().remove(bad)`. Is there a simpler way to traverse through document and remove `Ext` tag? – warezsoftwarez Apr 28 '20 at 12:59
  • I added edit to main post. It should be clearer right now. – warezsoftwarez Apr 28 '20 at 13:28
  • Please post a new question. It gets very confusing when you change the original question like this. – mzjn Apr 28 '20 at 13:36
  • 1
    *"The result of just reading, and trying to write document without any changes is that the namespaces miss many fields"* - No, certainly not. Every in-use namespace and every node of the input document will be in the output document after "just reading and writing". – Tomalak Apr 28 '20 at 13:37
  • mzjn and Tomalak, I just posted new question, as mzjn suggested. Would you have any advice? https://stackoverflow.com/questions/61481841/how-to-traverse-through-xml-document-tags-using-lxml-similarly-to-elementtree – warezsoftwarez Apr 28 '20 at 13:51
  • 1
    I have tested your code against your input document, and as I said, the input document and the output document are 100% equivalent. They just look a little different on the surface, that's nothing that you should be concerned about. – Tomalak Apr 28 '20 at 13:52
  • It's strange, because as mentioned in question, it does not preserve unused namespaces, which I need to preserve finally. – warezsoftwarez Apr 28 '20 at 13:57
  • No, you don't because unused namespaces have no meaning. There is no value in keeping them, so they are not kept. Add nodes that use those namespaces to the document when you want to keep them. – Tomalak Apr 28 '20 at 14:14

0 Answers0