2

I have an XBRL document, which should be an XML document.

I am trying to extract different tags grouped by their namespace. While the code appears to work with certain namespaces (us-gaap), it seems to fails for other ones (xbrli). However, in the xml file there are plenty of tags of type * < xbrli: ... >*

Code:

from bs4 import BeautifulSoup

with open('test.xml', 'r') as fp:
    raw_text = fp.read()

soup = BeautifulSoup(raw_text, 'xml')

print( len(soup.find_all(lambda tag: tag.prefix == 'us-gaap')) ) # print 941
print( len(soup.find_all(lambda tag: tag.prefix == 'xbrli')) ) # print 0

You can find the test.xml file here.

Nazim Kerimbekov
  • 4,712
  • 8
  • 34
  • 58
user1315621
  • 3,044
  • 9
  • 42
  • 86

2 Answers2

1

Can you try this code (using CSS selectors?). Using your code I get sometimes 1268 for xbrli tags, sometimes 0 (tested on old version of bs4==4.4.1). Also, which version of BeautifulSoup do you use?

from bs4 import BeautifulSoup, __version__

soup = BeautifulSoup(open('data.txt', 'r').read(), 'xml')

print('xbrli:* tags =', len(soup.select('xbrli|*')))
print('us-gaap:* tags =', len(soup.select('us-gaap|*')))

print('Version of bs4:', __version__)

Prints:

xbrli:* tags = 1268
us-gaap:* tags = 941
Version of bs4: 4.8.1
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Even I thought of about Css selector would be the best option.However since I couldn't reproduce after few times run I left if off.However nice thought.+1 – KunduK Dec 24 '19 at 16:24
  • 1
    @KunduK I've tested the OP's code on old version of bs4 `4.4.1`, and indeed, the sum of xbrli tags is (sometimes) 0. So I presume the OP is using old version of bs4. – Andrej Kesely Dec 24 '19 at 16:27
  • CSS namespace selector support wasn't added until 4.7.0 with the addition of soupsieve. – facelessuser Dec 24 '19 at 20:32
0

Using BeautifulSoup 4.8.1 solved the issue.

user1315621
  • 3,044
  • 9
  • 42
  • 86