0

I'm trying to scrape file from SEC Edgar's database. I'm able to get the text file using requests. When I try to parse the file using the following code I get parse error. The same code works when I request a .xml url and not a .txt url. Url has the following content:

<SEC-HEADER>0001752724-20-203989.hdr.sgml : 20201001
<ACCEPTANCE-DATETIME>20201001132951
ACCESSION NUMBER:       0001752724-20-203989
CONFORMED SUBMISSION TYPE:  NPORT-P
PUBLIC DOCUMENT COUNT:      2
CONFORMED PERIOD OF REPORT: 20200831
FILED AS OF DATE:       20201001
PERIOD START:               20201130

-------------
**
-------------
    FORMER COMPANY: 
        FORMER CONFORMED NAME:  ASA LTD
        DATE OF NAME CHANGE:    20070301

    FORMER COMPANY: 
        FORMER CONFORMED NAME:  ASA BERMUDA LTD
        DATE OF NAME CHANGE:    20030505
</SEC-HEADER>
<DOCUMENT>
<TYPE>NPORT-P
<SEQUENCE>1
<FILENAME>primary_doc.xml
<TEXT>
<XML>
<?xml version="1.0" encoding="UTF-8"?><edgarSubmission xmlns="http://www.sec.gov/edgar/nport" xmlns:com="http://www.sec.gov/edgar/common" xmlns:ncom="http://www.sec.gov/edgar/nportcommon" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sec.gov/edgar/nport eis_NPORT_Filer.xsd">
  <headerData>
    <submissionType>NPORT-P</submissionType>
    <isConfidential>false</isConfidential>
    <filerInfo>

      <filer>
        <issuerCredentials>
          <cik>0001230869</cik>
          <ccc>XXXXXXXX</ccc>

My code:

url = 'https://www.sec.gov/Archives/edgar/data/1230869/0001752724-20-203989.txt'
response = requests.get(url)
root = ET.fromstring(response.content)

Error:

Traceback (most recent call last):

  File "/usr/local/anaconda/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  File "<ipython-input-83-cd4e6ed59b34>", line 3, in <module>
    root = ET.fromstring(response.content)

  File "/usr/local/anaconda/lib/python3.6/xml/etree/ElementTree.py", line 1314, in XML
    parser.feed(text)

  File "<string>", line unknown
ParseError: not well-formed (invalid token): line 14, column 38
drew_psy
  • 95
  • 8
  • *The same code works when I request a .xml url and not a .txt url.* So, you're surprised that when you ask for a text file, it cannot be parsed as an XML file? – kjhughes Oct 04 '20 at 03:51
  • I want it to work with the text version since not all url have a .XML version available. Please refer to the question for more info, there is not element of surprise in here. – drew_psy Oct 04 '20 at 04:20
  • 3
    You cannot use XML tools or parsers on data that's not XML. What you've posted is not XML. (That's what *ParseError: not well-formed (invalid token)* is telling you.) You might be able to scan to the XML declaration `` and from there to the end of the file, you might be able to extract a well-formed XML file (but we cannot say for sure as you've only posted a portion of the top of the data). – kjhughes Oct 04 '20 at 04:34
  • That actually worked. I was able to extract the XML part between the XML HTML tags and load it into the XML parser. – drew_psy Oct 04 '20 at 05:03

0 Answers0