2

I'm trying to load a large-ish (53MB) XML file into a pandas dataframe. Here are 3 rows of actual data (from public database of NTSB aviation accident reports), but the actual file has 77257 rows:

<?xml version="1.0"?>
<DATA xmlns="http://www.ntsb.gov">
<ROWS>
    <ROW EventId="20150901X74304" InvestigationType="Accident" AccidentNumber="GAA15CA244" EventDate="09/01/2015" Location="Truckee, CA" Country="United States" Latitude="" Longitude="" AirportCode="" AirportName="" InjurySeverity="" AircraftDamage="" AircraftCategory="" RegistrationNumber="N786AB" Make="JOE SALOMONE" Model="SUPER CUB SQ2" AmateurBuilt="" NumberOfEngines="" EngineType="" FARDescription="" Schedule="" PurposeOfFlight="" AirCarrier="" TotalFatalInjuries="" TotalSeriousInjuries="" TotalMinorInjuries="" TotalUninjured="" WeatherCondition="" BroadPhaseOfFlight="" ReportStatus="Preliminary" PublicationDate=""/>
    <ROW EventId="20150901X92332" InvestigationType="Accident" AccidentNumber="CEN15LA392" EventDate="08/31/2015" Location="Houston, TX" Country="United States" Latitude="29.809444" Longitude="-95.668889" AirportCode="IWS" AirportName="WEST HOUSTON" InjurySeverity="Non-Fatal" AircraftDamage="Substantial" AircraftCategory="Airplane" RegistrationNumber="N452CS" Make="CESSNA" Model="T240" AmateurBuilt="No" NumberOfEngines="" EngineType="" FARDescription="Part 91: General Aviation" Schedule="" PurposeOfFlight="Instructional" AirCarrier="" TotalFatalInjuries="" TotalSeriousInjuries="" TotalMinorInjuries="" TotalUninjured="2" WeatherCondition="VMC" BroadPhaseOfFlight="LANDING" ReportStatus="Preliminary" PublicationDate="09/04/2015"/>
    <ROW EventId="20150729X33718" InvestigationType="Accident" AccidentNumber="CEN15FA325" EventDate="" Location="Truth or Consequences, NM" Country="United States" Latitude="33.250556" Longitude="-107.293611" AirportCode="TCS" AirportName="TRUTH OR CONSEQUENCES MUNI" InjurySeverity="Fatal(2)" AircraftDamage="Substantial" AircraftCategory="Airplane" RegistrationNumber="N32401" Make="PIPER" Model="PA-28-151" AmateurBuilt="No" NumberOfEngines="1" EngineType="Reciprocating" FARDescription="Part 91: General Aviation" Schedule="" PurposeOfFlight="Personal" AirCarrier="" TotalFatalInjuries="2" TotalSeriousInjuries="" TotalMinorInjuries="" TotalUninjured="" WeatherCondition="" BroadPhaseOfFlight="UNKNOWN" ReportStatus="Preliminary" PublicationDate="08/10/2015"/>
</ROWS>
</DATA>

The following code, which I adapted from here, works, but it is very slow for this data (more than 30 mins on my system). I can't seem to get the solution posted for the original example to work because the structure of my XML is different. Is there a more efficient way to load this data?

path_to_xml_file = mypath

import pandas as pd
import xml.etree.ElementTree as ET

#Load xml file data
tree = ET.parse(path_to_xml_file)
root = tree.getroot()

#Grab list of column names
aviationdata_column_names = root[0][0].attrib.keys()             
#Create empty dataframe   
aviationdata_df = pd.DataFrame(columns=aviationdata_column_names)

#Loop through tree and append to dataframe
for i in range(0,len(root[0])-1):
    new_row = pd.Series(root[0][i].attrib)
    new_row.name = i
    aviationdata_df = aviationdata_df.append(new_row)

There are various solutions to similar questions posted around the internet (here, here, and here), but I haven't had luck implementing them. Version issues might be responsible for some of that (I am using Python 2.7).

Community
  • 1
  • 1
Edward
  • 107
  • 1
  • 11
  • @jezrael When I try the first 3 of those 4 methods, they return a dataframe with either 0 or 1 element. How does one make those methods loop through the xml file and pull the entire structure in? – Edward Jan 22 '17 at 19:54
  • @jezrael alas, this is an exercise to learn to work with various file formats. I am asked to use xml for this particular dataset. – Edward Jan 22 '17 at 20:05
  • @jezrael - I appreciate the link reference to my earlier question. And I doubt xml is too complicated for you! – Parfait Jan 22 '17 at 21:28
  • @Parfait - I want give suggestion for OP for add coment to your question, but I forget. Super, I upvote your answer. – jezrael Jan 22 '17 at 21:29
  • Does this answer your question? [How to convert an XML file to nice pandas dataframe?](https://stackoverflow.com/questions/28259301/how-to-convert-an-xml-file-to-nice-pandas-dataframe) – iacob Apr 21 '21 at 07:54

1 Answers1

6

Since your XML is attribute centric (no element values), consider iterating across all attributes which stores in dictionary key/value pairs in xml.etree.ElementTree.

Below binds lists of attribute set dictionaries to the DataFrame() call:

import pandas as pd
import xml.etree.ElementTree as ET

path_to_xml_file = mypath

# Load xml file data
tree = ET.parse(path_to_xml_file)

data = []
for el in tree.iterfind('./*'):
    for i in el.iterfind('*'):
        data.append(dict(i.items()))

df = pd.DataFrame(data)

Output

# FIRST FEW COLUMNS
print(df[list(range(12))])

#   AccidentNumber AirCarrier AircraftCategory AircraftDamage AirportCode                 AirportName AmateurBuilt BroadPhaseOfFlight        Country     EngineType   EventDate         EventId
# 0     GAA15CA244                                                                                                                     United States                 09/01/2015  20150901X74304
# 1     CEN15LA392                    Airplane    Substantial         IWS                WEST HOUSTON           No            LANDING  United States                 08/31/2015  20150901X92332
# 2     CEN15FA325                    Airplane    Substantial         TCS  TRUTH OR CONSEQUENCES MUNI           No            UNKNOWN  United States  Reciprocating              20150729X33718
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • 1
    Thanks @Parfait. From >30 minutes to less than 1 second. Exactly what I hoped python was capable of. – Edward Jan 22 '17 at 21:33
  • 2
    Great! Turns out, you didn't even need the last `for` loop as attribute sets are dictionaries! See shorted code. – Parfait Jan 23 '17 at 17:35
  • How would you modify this for a file that has nested items? For example:
    12 Main Street
    – Korzak Aug 09 '19 at 19:56
  • this is good ... any idea on how to use it with an XML file both with elements and also a namespace? The method `i.find(.//{ns}key).text` does not work if I use the `.iterfind()` as suggested above ... – Lucas Aimaretto Apr 17 '20 at 20:55
  • 1
    @LucasAimaretto, all depends on XML which can vary quite a bit. So solution will need to be tailored. This solution is for a specific XML with mainly attributes. Please make an attempt and ask a new question. – Parfait Apr 17 '20 at 20:59
  • @Parfait, thanks for the reply. Indeed ... I do have an XML (w/namespace) with many elements and each of those with many attributes inside (the attributes are the same along elements). I'm parsing the file with a for-loop, but it can get slow sometimes if the file is big ... can't still find a better way of doing it to speed up the process ... – Lucas Aimaretto Apr 17 '20 at 21:02
  • Consider [XSLT to CSV](https://stackoverflow.com/a/61162362/1422451) requiring no loop. – Parfait Apr 17 '20 at 21:08