I want to convert a XML-file to a Dataframe so I can do Data Analysis trough Pandas, in my Python notebook. However all the solutions I find on this website give me errors.
The following is a short version of the XML:
<EML xmlns="urn:oasis:names:tc:evs:schema:eml" xmlns:ns2="urn:oasis:names:tc:ciq:xsdschema:xAL:2.0" xmlns:ns3="urn:oasis:names:tc:ciq:xsdschema:xNL:2.0" xmlns:ns4="http://www.w3.org/2000/09/xmldsig#" xmlns:ns5="urn:oasis:names:tc:evs:schema:eml:ts" xmlns:ns6="http://www.kiesraad.nl/extensions" xmlns:ns7="http://www.kiesraad.nl/reportgenerator" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" Id="230b" SchemaVersion="5" xsi:schemaLocation="urn:oasis:names:tc:evs:schema:eml 230-candidatelist-v5-0.xsd http://www.kiesraad.nl/extensions kiesraad-eml-extensions.xsd">
<!--
Created by: Ondersteunende Software Verkiezingen by IVU Traffic Technologies AG, program: P2-3, version: 2.19.2
-->
<TransactionId>1</TransactionId>
<ManagingAuthority>
<AuthorityIdentifier Id="CSB">De Kiesraad</AuthorityIdentifier>
<AuthorityAddress/>
</ManagingAuthority>
<IssueDate>2017-02-13</IssueDate>
<ns6:CreationDateTime>2017-02-13T16:35:14.403+01:00</ns6:CreationDateTime>
<ns4:CanonicalizationMethod Algorithm="http://www.w3.org/TR/2001/REC-xml-c14n-20010315#WithComments"/>
<CandidateList>
<Election>
<ElectionIdentifier Id="TK2017">
<ElectionName>Tweede Kamer der Staten-Generaal 2017</ElectionName>
<ElectionCategory>TK</ElectionCategory>
<ns6:ElectionSubcategory>TK</ns6:ElectionSubcategory>
<ns6:ElectionDate>2017-03-15</ns6:ElectionDate>
<ns6:NominationDate>2017-01-30</ns6:NominationDate>
</ElectionIdentifier>
<Contest>
<ContestIdentifier Id="9">
<ContestName>Amsterdam</ContestName>
</ContestIdentifier>
<Affiliation>
<AffiliationIdentifier Id="1">
<RegisteredName>VVD</RegisteredName>
</AffiliationIdentifier>
<Type>stel gelijkluidende lijsten</Type>
<ns6:ListData BelongsToSet="1" PublicationLanguage="nl" PublishGender="true"/>
<Candidate>
<CandidateIdentifier Id="1"/>
<CandidateFullName>
<ns3:PersonName>
<ns3:NameLine NameType="Initials">M.</ns3:NameLine>
<ns3:FirstName>Mark</ns3:FirstName>
<ns3:LastName>Rutte</ns3:LastName>
</ns3:PersonName>
</CandidateFullName>
<Gender>male</Gender>
<QualifyingAddress>
<ns2:Locality>
<ns2:LocalityName>'s-Gravenhage</ns2:LocalityName>
</ns2:Locality>
</QualifyingAddress>
</Candidate>
<Candidate>
<CandidateIdentifier Id="2"/>
<CandidateFullName>
<ns3:PersonName>
<ns3:NameLine NameType="Initials">J.A.</ns3:NameLine>
<ns3:FirstName>Jeanine</ns3:FirstName>
<ns3:LastName>Hennis-Plasschaert</ns3:LastName>
</ns3:PersonName>
</CandidateFullName>
<Gender>female</Gender>
<QualifyingAddress>
<ns2:Locality>
<ns2:LocalityName>Nederhorst den Berg</ns2:LocalityName>
</ns2:Locality>
</QualifyingAddress>
</Candidate>
</Affiliation>
</Contest>
</Election>
</CandidateList>
</EML>
I want to adjust it through Python commands and want to do it al in my Kernel, and then I want to be able to do my Pandas on this.
Datasource: https://data.openstate.eu/dataset/kandidatenlijsten This is the code I am using now: http://www.austintaylor.io/lxml/python/pandas/xml/dataframe/2016/07/08/convert-xml-to-pandas-dataframe/:`
import xml.etree.ElementTree as ET
from lxml import etree
import pandas as pd
xml_data = 'Kandidatenlijsten_TK2017_Amsterdam.eml'
def xml2df(xml_data):
tree = ET.parse(xml_data)
root = tree.getroot()
all_records = []
headers = []
for i, child in enumerate(root):
record = []
for subchild in child:
record.append(subchild.text)
if subchild.tag not in headers:
headers.append(subchild.tag)
all_records.append(record)
return pd.DataFrame(all_records, columns=headers)`
This gives the error:AssertionError: 3 columns passed, passed data had 2 columns
Thank you Kindly, Kind regards.