parsing xbrl with python and regular expression to find TextBlocks

Question

I'm using python and ElementTree to access a list of of .xml files scraped from EDGAR. I've read and re-read the ElementTree/python.org page and am still not understanding how to drill down into the data. How am I supposed to use ElementTree to get something like the first TextBlock for the listed .xmls

import import re
from urllib2 import urlopen
import requests
import xml.etree.ElementTree as ET
full_xml =['https://www.sec.gov/Archives/edgar/data/1593001/000121390017010242/ngtf-20170630.xml', 'https://www.sec.gov/Archives/edgar/data/13573/000143774917016692/bwla-20170702.xml', 'https://www.sec.gov/Archives/edgar/data/1652871/000165287117000030/none-20170630.xml', 'https://www.sec.gov/Archives/edgar/data/1434674/000154972717000042/chnd-20170630_cal.xml', 'https://www.sec.gov/Archives/edgar/data/1083922/000130841117000030/arao-20170331.xml']
for xml in full_xml:
    file = urllib2.urlopen(xml)
    tree = ET.parse(file)
    root = tree.getroot()
    print root

Ghislain Fourny · Answer 1 · 2017-10-04T12:11:38.353

The information to find textblocks is not only in the XBRL instance (main .xml file). It is also in the taxonomy schema files that belong to the DTS.

Finding textblock facts at the level of XML would require:

constructing the DTS by resolving all links to schemas and linkbases from the instance
building a list of concepts gathered from all the schemas found, together with their metadata
filter these concepts by type (you want to find those with type nonnum:textBlockItemType -- namespace-sensitive comparison)
lookup the facts in the XBRL instance that are associated to a concept that made it through the above filter
potentially dealing with dimensions to only include dimensionless facts

This would would be theoretically doable, but it would be very complex and resource consuming to do at the level of XML, and prone to errors -- even more so using a library within an imperative language outside of the XML technology stack (such as XQuery). In fact, this amounts to reimplementing a (partial) XBRL processor and this is beyond what regular expressions can do.

In general, I strongly recommend using an existing XBRL processor -- there are open source processors out there, some may even be compatible with python -- where the above logics is already implemented, and it suffices to use an API (e.g., REST or python) to browse through concepts, select text blocks, and lookup the facts with the appropriate data model.

The XBRL technology stack is still at its debuts and many processors are still not dealing with dimensions at the appropriate abstraction level, but if it continues gaining popularity the number of products should increase, and they should become more complete and stable.

Ghislain thank you for your in-depth response. Would you happen to have a favorite XBRL processor - or recomend an easily accessible open source??? — Derek_P, Oct 04 '17 at 15:05
An example of open source processor is Arelle. I can also mention some I have already used, such as the proprietary ReportingStandard and Fujitsu XBRL tools, but there are many others. This is a diverse ecosystem; standard features that XBRL software can typically have are a UI to browse or edit a taxonomy or instance, import/export from Excel, programming APIs like Java or REST... It is worth testing a tool around to see if it matches your needs. Further tools are mentioned on https://www.xbrl.org/the-standard/how/tools-and-services/ — Ghislain Fourny, Jul 06 '18 at 09:31

parsing xbrl with python and regular expression to find TextBlocks

1 Answers1