BeaitifulSoup can't read all the namespaces

Question

I have an XBRL document, which should be an XML document.

I am trying to extract different tags grouped by their namespace. While the code appears to work with certain namespaces (us-gaap), it seems to fails for other ones (xbrli). However, in the xml file there are plenty of tags of type * < xbrli: ... >*

Code:

from bs4 import BeautifulSoup

with open('test.xml', 'r') as fp:
    raw_text = fp.read()

soup = BeautifulSoup(raw_text, 'xml')

print( len(soup.find_all(lambda tag: tag.prefix == 'us-gaap')) ) # print 941
print( len(soup.find_all(lambda tag: tag.prefix == 'xbrli')) ) # print 0

You can find the test.xml file here.

I can't replicate this.I am getting both the value such as `941` and `1268` using your code. — KunduK, Dec 24 '19 at 15:54

Andrej Kesely · Answer 1 · 2019-12-24T16:31:34.677

1

Can you try this code (using CSS selectors?). Using your code I get sometimes 1268 for xbrli tags, sometimes 0 (tested on old version of bs4==4.4.1). Also, which version of BeautifulSoup do you use?

from bs4 import BeautifulSoup, __version__

soup = BeautifulSoup(open('data.txt', 'r').read(), 'xml')

print('xbrli:* tags =', len(soup.select('xbrli|*')))
print('us-gaap:* tags =', len(soup.select('us-gaap|*')))

print('Version of bs4:', __version__)

Prints:

xbrli:* tags = 1268
us-gaap:* tags = 941
Version of bs4: 4.8.1

edited Dec 24 '19 at 16:31

answered Dec 24 '19 at 16:02

Andrej Kesely

168,389
15
48
91

Even I thought of about Css selector would be the best option.However since I couldn't reproduce after few times run I left if off.However nice thought.+1 – KunduK Dec 24 '19 at 16:24
1

@KunduK I've tested the OP's code on old version of bs4 `4.4.1`, and indeed, the sum of xbrli tags is (sometimes) 0. So I presume the OP is using old version of bs4. – Andrej Kesely Dec 24 '19 at 16:27
CSS namespace selector support wasn't added until 4.7.0 with the addition of soupsieve. – facelessuser Dec 24 '19 at 20:32

score 0 · Accepted Answer · answered Jan 06 '20 at 17:27

0

Using BeautifulSoup 4.8.1 solved the issue.

answered Jan 06 '20 at 17:27

user1315621

3,044
9
42
86

BeaitifulSoup can't read all the namespaces

2 Answers2