I'm a bit new to XML and python. Below is a cut down version of a large XML file I'm trying to bring into python to eventually write into SQL Server db.
<?xml version="1.0" encoding="utf-8"?>
<MyOrgRefData:OrgRefData xmlns:MyOrgRefData="http://refdata.org/org/v2-0-0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://refdata.org/org/v2-0-0/MyOrgRefData.xsd">
<Manifest>
<Version value="2-0-0" />
<PublicationType value="Full" />
<PublicationSource value="TEST123" />
<PublicationDate value="2022-05-23" />
<PublicationSeqNum value="1659" />
<FileCreationDateTime value="2022-05-23T22:14:47" />
<RecordCount value="287654" />
<ContentDescription value="FullFile_20220523" />
<PrimaryRoleScope>
<PrimaryRole id="123" displayName="Free beer for me" />
<PrimaryRole id="456" displayName="Free air for you" />
</PrimaryRoleScope>
</Manifest>
<CodeSystems>
<CodeSystem name="OrganisationRecordClass" oid="1.2.3.4.5">
<concept id="RC2" code="2" displayName="World1" />
<concept id="RC1" code="1" displayName="World2" />
</CodeSystem>
<CodeSystem name="OrganisationRole" oid="5.4.7.8">
<concept id="B1ng0" code="179" displayName="BoomBastic" />
<concept id="R2D2a" code="180" displayName="Fantastic" />
</CodeSystem>
</CodeSystems>
</MyOrgRefData:OrgRefData>
I've tried with lxml, pandas.read_xml, xml.etree and I'm not able to understand how to get what I want.
Ideally I'd like to pull in Manifest into a dataframe ready to to send to SQL (pd.to_sql()). I would do the same with CodeSystems as well, but separately. (there are other sections but I cut them off to shorten)
For example, using pandas to read in, I can only get a column with the values in. But I would like to either have the tag (Version, PublicationType, PublicationSource etc) in a column by the side of the value, or have them as the column headers and the values pivoted across the row instead.
dataFolder = '/Some/directory'
df_bulk = pd.read_xml(
dataFolder+'Data_Full_20220523.xml',
xpath='//Manifest/*',
attrs_only=True ,
)
df_bulk.head()
This is the output I get:
inx | value |
---|---|
0 | 2-0-0 |
1 | Full |
2 | TEST123 |
3 | 2022-05-23 |
4 | 1659 |
5 | 2022-05-23T22:14:47 |
6 | 287654 |
7 | FullFile_20220523 |
Ideally I would like:
inx | value |
---|---|
Version | 2-0-0 |
PublicationType | Full |
PublicationSource | TEST123 |
PublicationDate | 2022-05-23 |
PublicationSeqNum | 1659 |
FileCreationDateTime | 2022-05-23T22:14:47 |
FileCreationDateTime | 287654 |
ContentDescription | FullFile_20220523 |
The eagle eyed among you will notice I've left out PrimaryRoleScope. I would ideally like to treat this separately in it's own dataframe as well. But I am unsure how to exclude it when pulling in the rest of the Manifest section.
Many thanks if you've read this far, even more thanks for any help.