0

I have been using the package pandas_read_xml to read XML files into pandas dataframe. However, I have started experiencing very strange behavior with this package lately. The xml parser occasionally crashes, but on repeated attempts, it works. I am really puzzled by this, so I was hoping if anyone here has the possibility to help me wrap my head around it. I will attempt to illustrate the problem I am facing below.

  import pandas as pd
  import pandas_read_xml as pdx

  data = pdx.read_xml('https://www.sec.gov/Archives/edgar/data/1000351/000114554921012283/primary_doc.xml', ['edgarSubmission'])

This occasionally returns an error “ExpatError: mismatched tag: line 50, column 124”. However, it works just fine upon repeated attempts. Similar behavior is observed for other paths. I have made sure that nothing is off about the xml file. I took a look at the Traceback and it contains the following:

 File "<ipython-input-118-c68fdb3a2633>", line 1, in <module>
 data = pdx.read_xml('https://www.sec.gov/Archives/edgar/data/1002537/000114554921006264/primary_doc.xml',['edgarSubmission'])

 File "C:\Users\A1610222\AppData\Local\Continuum\anaconda2\lib\site-packages\pandas_read_xml.py", 
 line 33, in read_xml return read_xml_as_dataframe(read_xml_from_url(path_or_xml), root_key_list, 
 root_is_rows=root_is_rows, transpose=transpose)

 File "C:\Users\A1610222\AppData\Local\Continuum\anaconda2\lib\site-packages\pandas_read_xml.py", 
 line 62, in read_xml_as_dataframe return pd.DataFrame([get_to_root_in_dict(xmltodict.parse(xml), 
 root_key_list)])

 File "C:\Users\A1610222\AppData\Local\Continuum\anaconda2\lib\site-packages\xmltodict.py", line 327, 
 in parse parser.Parse(xml_input, True)

 ExpatError: mismatched tag: line 50, column 124

It appears to be directing to line 33 and 62 in the package pandas_read_xml. I have uninstalled and reinstalled the package to make sure nothing is off, but the problem persists. Please excuse my ignorance if there is something completely elementary that I am missing. Please let me know in case anything is not clear. Looking forward to your kind help.

stump
  • 85
  • 1
  • 6
  • 1
    Not familiar with the package; however, I know @Parfait is working on a `read_xml` capability directly in pandas. In the meantime, why not read the file with lxml then create the dataframe? You could share the expected output dataframe If you need assistance from the community – sammywemmy Apr 22 '21 at 11:22

1 Answers1

0

I discovered today that the problem was due to connectivity issue and had nothing to do with the package or the structure of xml files.

stump
  • 85
  • 1
  • 6