Parse Error for XML from url response (text file) with HTML block in starting

Question

I'm trying to scrape file from SEC Edgar's database. I'm able to get the text file using requests. When I try to parse the file using the following code I get parse error. The same code works when I request a .xml url and not a .txt url. Url has the following content:

<SEC-HEADER>0001752724-20-203989.hdr.sgml : 20201001
<ACCEPTANCE-DATETIME>20201001132951
ACCESSION NUMBER:       0001752724-20-203989
CONFORMED SUBMISSION TYPE:  NPORT-P
PUBLIC DOCUMENT COUNT:      2
CONFORMED PERIOD OF REPORT: 20200831
FILED AS OF DATE:       20201001
PERIOD START:               20201130

-------------
**
-------------
    FORMER COMPANY: 
        FORMER CONFORMED NAME:  ASA LTD
        DATE OF NAME CHANGE:    20070301

    FORMER COMPANY: 
        FORMER CONFORMED NAME:  ASA BERMUDA LTD
        DATE OF NAME CHANGE:    20030505
</SEC-HEADER>
<DOCUMENT>
<TYPE>NPORT-P
<SEQUENCE>1
<FILENAME>primary_doc.xml
<TEXT>
<XML>
<?xml version="1.0" encoding="UTF-8"?><edgarSubmission xmlns="http://www.sec.gov/edgar/nport" xmlns:com="http://www.sec.gov/edgar/common" xmlns:ncom="http://www.sec.gov/edgar/nportcommon" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sec.gov/edgar/nport eis_NPORT_Filer.xsd">
  <headerData>
    <submissionType>NPORT-P</submissionType>
    <isConfidential>false</isConfidential>
    <filerInfo>

      <filer>
        <issuerCredentials>
          <cik>0001230869</cik>
          <ccc>XXXXXXXX</ccc>

My code:

url = 'https://www.sec.gov/Archives/edgar/data/1230869/0001752724-20-203989.txt'
response = requests.get(url)
root = ET.fromstring(response.content)

Error:

Traceback (most recent call last):

  File "/usr/local/anaconda/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  File "<ipython-input-83-cd4e6ed59b34>", line 3, in <module>
    root = ET.fromstring(response.content)

  File "/usr/local/anaconda/lib/python3.6/xml/etree/ElementTree.py", line 1314, in XML
    parser.feed(text)

  File "<string>", line unknown
ParseError: not well-formed (invalid token): line 14, column 38

*The same code works when I request a .xml url and not a .txt url.* So, you're surprised that when you ask for a text file, it cannot be parsed as an XML file? — kjhughes, Oct 04 '20 at 03:51
I want it to work with the text version since not all url have a .XML version available. Please refer to the question for more info, there is not element of surprise in here. — drew_psy, Oct 04 '20 at 04:20
You cannot use XML tools or parsers on data that's not XML. What you've posted is not XML. (That's what *ParseError: not well-formed (invalid token)* is telling you.) You might be able to scan to the XML declaration `` and from there to the end of the file, you might be able to extract a well-formed XML file (but we cannot say for sure as you've only posted a portion of the top of the data). — kjhughes, Oct 04 '20 at 04:34
That actually worked. I was able to extract the XML part between the XML HTML tags and load it into the XML parser. — drew_psy, Oct 04 '20 at 05:03

Parse Error for XML from url response (text file) with HTML block in starting

0 Answers0