The structure of the xml-file basically looks like this, it's bibliographic data in the format MARC21-xml (used by libraries all over the place):
<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim">
<record type="Bibliographic">
<leader> ... </leader>
<controlfield> ... </controlfield>
...
<controlfield> ... </controlfield>
<datafield tag="123" ... >
<subfield code="x"> ... </subfield>
...
<subfield code="x"> ... </subfield>
</datafield>
<datafield tag="456" ...>
There's a proper example-file here: https://www.loc.gov/standards/marcxml/Sandburg/sandburg.xml - however, this only represents one item (e.g. a specific book), usually these file contain hundreds to thousands of the records - so the record tag with all its content is repeatable.
There file I work with has over 10,000 record-tags in it (all representing different items), all of which have a datafield with the tag "082" and then several subfields.
I am now trying to extract the text in the subfield with the code="a"
- however, since this field is also repeatable and some records have two of those, I always only want the first one. My current code, which extracts the text for ALL subfields code="a"
in these datafields looks like this:
for child in record.findall("{http://www.loc.gov/MARC21/slim}datafield[@tag='082']"):
for subelement in child:
if subelement.attrib['code'] == "a":
ddc = subelement.text
ddccoll.append(ddc)
This works, but, as I said, returns too many elements, if I run it and then print the length of my list it returns 10277, however, only 10123 records are in this file, so there's a few too many, probably due to its repeatability.
I tried using find
instead of findall
, but then get the error message that `TypeError:
TypeError Traceback (most recent call last)
<ipython-input-23-fae786776bcf> in <module>
18 idcoll.append("nicht vorhanden")
19
---> 20 for child in record.find("{http://www.loc.gov/MARC21/slim}datafield[@tag='082']"):
21 for subelement in child:
22 if subelement.attrib['code'] == "a":
TypeError: 'NoneType' object is not iterable
I am not exactly sure why, since the field 082 should be present in every single record - but since I am actually really after the subfield, this is probably not the right approach anyway. Now I have tried to go one layer deeper and simply look for the first subelement with the code a with the following code:
for child in record.findall("{http://www.loc.gov/MARC21/slim}datafield[@tag='082']"):
for subelement in child.find("{http://www.loc.gov/MARC21/slim}subfield[@code='a']"):
if subelement:
ddc = subelement.text
ddccoll.append(ddc)
However, this doesn't return resp. append anything, if I print the length of the list afterwards it says "0". I have also done the same for authors and the ids and it's working for those. I am trying to get this right so that afterwards I can create a Dataframe with authors, ids, titles etc.
I am currently completely stuck at this: Is the path wrong? Is there another, simpler, better way of doing this?