0

I want to convert a XML-file to a Dataframe so I can do Data Analysis trough Pandas, in my Python notebook. However all the solutions I find on this website give me errors.

The following is a short version of the XML:

    <EML xmlns="urn:oasis:names:tc:evs:schema:eml" xmlns:ns2="urn:oasis:names:tc:ciq:xsdschema:xAL:2.0" xmlns:ns3="urn:oasis:names:tc:ciq:xsdschema:xNL:2.0" xmlns:ns4="http://www.w3.org/2000/09/xmldsig#" xmlns:ns5="urn:oasis:names:tc:evs:schema:eml:ts" xmlns:ns6="http://www.kiesraad.nl/extensions" xmlns:ns7="http://www.kiesraad.nl/reportgenerator" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" Id="230b" SchemaVersion="5" xsi:schemaLocation="urn:oasis:names:tc:evs:schema:eml 230-candidatelist-v5-0.xsd http://www.kiesraad.nl/extensions kiesraad-eml-extensions.xsd">
<!--
Created by: Ondersteunende Software Verkiezingen by IVU Traffic Technologies AG, program: P2-3, version: 2.19.2
-->
<TransactionId>1</TransactionId>
<ManagingAuthority>
<AuthorityIdentifier Id="CSB">De Kiesraad</AuthorityIdentifier>
<AuthorityAddress/>
</ManagingAuthority>
<IssueDate>2017-02-13</IssueDate>
<ns6:CreationDateTime>2017-02-13T16:35:14.403+01:00</ns6:CreationDateTime>
<ns4:CanonicalizationMethod Algorithm="http://www.w3.org/TR/2001/REC-xml-c14n-20010315#WithComments"/>
<CandidateList>
<Election>
<ElectionIdentifier Id="TK2017">
<ElectionName>Tweede Kamer der Staten-Generaal 2017</ElectionName>
<ElectionCategory>TK</ElectionCategory>
<ns6:ElectionSubcategory>TK</ns6:ElectionSubcategory>
<ns6:ElectionDate>2017-03-15</ns6:ElectionDate>
<ns6:NominationDate>2017-01-30</ns6:NominationDate>
</ElectionIdentifier>
<Contest>
<ContestIdentifier Id="9">
<ContestName>Amsterdam</ContestName>
</ContestIdentifier>
<Affiliation>
<AffiliationIdentifier Id="1">
<RegisteredName>VVD</RegisteredName>
</AffiliationIdentifier>
<Type>stel gelijkluidende lijsten</Type>
<ns6:ListData BelongsToSet="1" PublicationLanguage="nl" PublishGender="true"/>
<Candidate>
<CandidateIdentifier Id="1"/>
<CandidateFullName>
<ns3:PersonName>
<ns3:NameLine NameType="Initials">M.</ns3:NameLine>
<ns3:FirstName>Mark</ns3:FirstName>
<ns3:LastName>Rutte</ns3:LastName>
</ns3:PersonName>
</CandidateFullName>
<Gender>male</Gender>
<QualifyingAddress>
<ns2:Locality>
<ns2:LocalityName>'s-Gravenhage</ns2:LocalityName>
</ns2:Locality>
</QualifyingAddress>
</Candidate>
<Candidate>
<CandidateIdentifier Id="2"/>
<CandidateFullName>
<ns3:PersonName>
<ns3:NameLine NameType="Initials">J.A.</ns3:NameLine>
<ns3:FirstName>Jeanine</ns3:FirstName>
<ns3:LastName>Hennis-Plasschaert</ns3:LastName>
</ns3:PersonName>
</CandidateFullName>
<Gender>female</Gender>
<QualifyingAddress>
<ns2:Locality>
<ns2:LocalityName>Nederhorst den Berg</ns2:LocalityName>
</ns2:Locality>
</QualifyingAddress>
</Candidate>
</Affiliation>
</Contest>
</Election>
</CandidateList>
</EML>

I want to adjust it through Python commands and want to do it al in my Kernel, and then I want to be able to do my Pandas on this.

Datasource: https://data.openstate.eu/dataset/kandidatenlijsten This is the code I am using now: http://www.austintaylor.io/lxml/python/pandas/xml/dataframe/2016/07/08/convert-xml-to-pandas-dataframe/:`

import xml.etree.ElementTree as ET
from lxml import etree
import pandas as pd

xml_data = 'Kandidatenlijsten_TK2017_Amsterdam.eml'

def xml2df(xml_data):
    tree = ET.parse(xml_data)
    root = tree.getroot()
    all_records = []
    headers = []
    for i, child in enumerate(root):
        record = []
        for subchild in child:
            record.append(subchild.text)
            if subchild.tag not in headers:
                headers.append(subchild.tag)
        all_records.append(record)
    return pd.DataFrame(all_records, columns=headers)`

This gives the error:AssertionError: 3 columns passed, passed data had 2 columns

Thank you Kindly, Kind regards.

Martijn
  • 3
  • 5
  • 1
    so what is that code you are using? and what errors does it give you? – patrick May 28 '17 at 19:28
  • Using this code: http://www.austintaylor.io/lxml/python/pandas/xml/dataframe/2016/07/08/convert-xml-to-pandas-dataframe/ error i get is : AssertionError: 3 columns passed, passed data had 2 columns – Martijn May 28 '17 at 19:35
  • I mean the actual code you are using, not the tutorial you are following. you can paste it into your question, using the code formatting. – patrick May 28 '17 at 19:38
  • I added it now! – Martijn May 28 '17 at 19:44
  • @Martijn, can you post a link to "Kandidatenlijsten_TK2017_Amsterdam.eml"? – MaxU - stand with Ukraine May 28 '17 at 19:53
  • How many things are in the variable `headers`? Can you check / print it out? – patrick May 28 '17 at 19:53
  • @MaxU https://data.openstate.eu/dataset/kandidatenlijsten/resource/dd639f9f-4035-4e17-b09b-33fc388e9780 – Martijn May 28 '17 at 20:12
  • @patrick It's empty and then gives error right away – Martijn May 28 '17 at 20:27
  • Indeed. Probably the loops don't work out for you. I suggest adding a print statement to every loop so you can see which one gets executed and which one doesn't. For instance, add after `for i, child` a `print ("i is", i)` and `print ("child is", child)` etc. That will help you debug your script. – patrick May 28 '17 at 20:49
  • @Martijn, any update on this question. Did u manage to make the code work or what other approach did you use ? –  Nov 18 '19 at 15:00
  • Does this answer your question? [How to convert an XML file to nice pandas dataframe?](https://stackoverflow.com/questions/28259301/how-to-convert-an-xml-file-to-nice-pandas-dataframe) – iacob Mar 25 '21 at 08:33

0 Answers0