0

The structure of the xml-file basically looks like this, it's bibliographic data in the format MARC21-xml (used by libraries all over the place):

<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim">
<record type="Bibliographic">
    <leader> ... </leader>
    <controlfield> ... </controlfield>
    ...
    <controlfield> ... </controlfield>
    <datafield tag="123" ... >
        <subfield code="x"> ... </subfield>
        ...
        <subfield code="x"> ... </subfield>
    </datafield>
    <datafield tag="456" ...> 

There's a proper example-file here: https://www.loc.gov/standards/marcxml/Sandburg/sandburg.xml - however, this only represents one item (e.g. a specific book), usually these file contain hundreds to thousands of the records - so the record tag with all its content is repeatable.

There file I work with has over 10,000 record-tags in it (all representing different items), all of which have a datafield with the tag "082" and then several subfields. I am now trying to extract the text in the subfield with the code="a" - however, since this field is also repeatable and some records have two of those, I always only want the first one. My current code, which extracts the text for ALL subfields code="a" in these datafields looks like this:

 for child in record.findall("{http://www.loc.gov/MARC21/slim}datafield[@tag='082']"):
        for subelement in child:
            if subelement.attrib['code'] == "a":
                ddc = subelement.text
                ddccoll.append(ddc)

This works, but, as I said, returns too many elements, if I run it and then print the length of my list it returns 10277, however, only 10123 records are in this file, so there's a few too many, probably due to its repeatability.

I tried using find instead of findall, but then get the error message that `TypeError:

TypeError                                 Traceback (most recent call last)
<ipython-input-23-fae786776bcf> in <module>
     18         idcoll.append("nicht vorhanden")
     19 
---> 20     for child in record.find("{http://www.loc.gov/MARC21/slim}datafield[@tag='082']"):
     21         for subelement in child:
     22             if subelement.attrib['code'] == "a":

TypeError: 'NoneType' object is not iterable

I am not exactly sure why, since the field 082 should be present in every single record - but since I am actually really after the subfield, this is probably not the right approach anyway. Now I have tried to go one layer deeper and simply look for the first subelement with the code a with the following code:

for child in record.findall("{http://www.loc.gov/MARC21/slim}datafield[@tag='082']"):
    for subelement in child.find("{http://www.loc.gov/MARC21/slim}subfield[@code='a']"):
        if subelement: 
            ddc = subelement.text
            ddccoll.append(ddc)

However, this doesn't return resp. append anything, if I print the length of the list afterwards it says "0". I have also done the same for authors and the ids and it's working for those. I am trying to get this right so that afterwards I can create a Dataframe with authors, ids, titles etc.

I am currently completely stuck at this: Is the path wrong? Is there another, simpler, better way of doing this?

ssp24
  • 121
  • 1
  • 8
  • 1
    Please, always include the *full* traceback of exceptions you get. We don't know now exactly *what line* triggered the exception, or how Python got there. – Martijn Pieters Feb 18 '21 at 11:06
  • 1
    Are you using lxml by any chance, or is this the standard library `xml.etree.ElementTree` module? – Martijn Pieters Feb 18 '21 at 11:10
  • 1
    I also note that it doesn't appear that you included the code section that throws the exception. We can't really help with that part, we don't know what exactly you did. Can you please also include, in your XML example, some sample tags that show what you want to extract and what *extra* data is extracted you don't want? That would make this much closer to our requirements for an [mcve]. We want a) sample input, b) exactly what you see when you run your code (including tracebacks or wrong data) and c) exactly what you do want to produce. – Martijn Pieters Feb 18 '21 at 11:13
  • 1
    Finally, it _sounds_ as if an xpath expression that directly looks for the first `subfield` child element with `code="a"` inside a `datafield` element with `tag="082"` would suffice here. I can't remember if the standard library ElementTree implementation of XPath is sufficient for such an expression or if you should be using `lxml` instead, but I am not really willing to guess if my interpretation of your question is correct. An input example with connected (wrong) output would really be helpful in illustrating your issue. – Martijn Pieters Feb 18 '21 at 11:15
  • 1
    For the XPath expression, you can use `/path/to/parent[attribute selector]/child[attribute selector]` to limit your search to child elements whose parents match a specific attribute. I _think_ you can then use `[1]` to only get the first such child for every parent (given my reading of [How to select the first element with a specific attribute using XPath](https://stackoverflow.com/q/1006283), which asked for just one match). It may be that the stdlib ElementTree implementation is not up to this; in that case use [`lxml`](https://lxml.de/) and the `xpath()` method. – Martijn Pieters Feb 18 '21 at 11:26
  • First of all thanks a lot for the quick anwers! I will try if I can get it to work with lxml or by changing the path! I have also tried adding more information and the traceback above if that helps – ssp24 Feb 18 '21 at 11:31
  • 1
    `record.find()` will always return a *single element*. In this case you got `None` as the result, perhaps you are processing **multiple** documents and *some* don't have the tag presen? – Martijn Pieters Feb 18 '21 at 11:35

1 Answers1

1

I assume that you have read your XML with the following code:

import xml.etree.ElementTree as et

tree = et.parse('Input.xml')
root = tree.getroot()

To reach your wanted elements you can use the following code:

# Namespace dictionary
ns = {'slim': 'http://www.loc.gov/MARC21/slim'}
# Process "datafield" elements with the required "tag" attribute
for it in root.findall('.//slim:datafield[@tag="082"]', ns):
    print(f'{it.tag:10}, {it.attrib}')
    # Find the first child with "code" == "a"
    child = it.find('slim:*[@code="a"]', ns)
    if isinstance(child, et.Element):  # Something found
        print(f'  {child.tag:10}, {child.attrib}, {child.text}')
    else:
        print('  Nothing found')

In the above sample I included only print statements for the elements found, but you can do with them anything you wish.

Using the following source XML:

<?xml version="1.0" encoding="UTF-8"?>
<collection xmlns="http://www.loc.gov/MARC21/slim">
  <record type="Bibliographic">
    <leader>...</leader>
    <controlfield>...</controlfield>
    <datafield tag="082" id="1">
        <subfield code="a">a1</subfield>
        <subfield code="x">x1</subfield>
        <subfield code="a">a2</subfield>
    </datafield>
    <datafield tag="456" id="2">
        <subfield code="a">a3</subfield>
    </datafield>
    <datafield tag="082" id="3">
        <subfield code="a">a4</subfield>
        <subfield code="x">x2</subfield>
        <subfield code="a">a5</subfield>
    </datafield>
  </record>
  <record type="Bibliographic">
    <leader>...</leader>
    <controlfield>...</controlfield>
    <datafield tag="082" id="4">
        <subfield code="a">a6</subfield>
        <subfield code="x">x3</subfield>
        <subfield code="a">a7</subfield>
    </datafield>
    <datafield tag="456" id="5">
        <subfield code="a">a8</subfield>
    </datafield>
  </record>
</collection>

I got the following result:

{http://www.loc.gov/MARC21/slim}datafield, {'tag': '082', 'id': '1'}
  {http://www.loc.gov/MARC21/slim}subfield, {'code': 'a'}, a1
{http://www.loc.gov/MARC21/slim}datafield, {'tag': '082', 'id': '3'}
  {http://www.loc.gov/MARC21/slim}subfield, {'code': 'a'}, a4
{http://www.loc.gov/MARC21/slim}datafield, {'tag': '082', 'id': '4'}
  {http://www.loc.gov/MARC21/slim}subfield, {'code': 'a'}, a6
Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41