0

It's a little complicated - but necessary - to explain the backstory, so some patience is requested.

I'm trying to parse an SEC Edgar filing (this Form 10-K, as a random example), not for its financial data, but for the list of Exhibits contained in a table toward the end of the document. Each document has in that table 3 attributes I'm interested in (exhibit number, title and URL), but for this example I'll focus only on the URL.

Finding all the URLs in the document is easy enough to begin with:

from lxml import etree
import lxml.html

for element in tree.iter('a'):
   target = element.values()[0]

But since the document may contain hundreds of URLs, most of which are irrelevant, I have to filter the results for the presence of the word Archives which appears without exception in all Edgar URLs. So in the next stage, I get the xpath of each of them:

if target is not None and 'Archives' in target:      
               print(tree.getpath(element))

So far so good, but this is where I get stuck: it turns out that, for some really bizarre reason, each of the relevant URLs appears not in one but two (and in some documents - up to four!) tables and that these tables are not, unfortunately, the first or last tables in the document but randomly stuck somewhere in the middle. So, for example, Exhibit 10-5's xpaths are:

/html/body/document/type/sequence/filename/text/div[2]/table[9]/tr[17]/td[3]/p/a

/html/body/document/type/sequence/filename/text/div[2]/table[12]/tr[17]/td[3]/p/a

So the URL appears in exactly the same location in both table 9 and table 12. Obviously, I don't want this URL to appear twice is my final URL list, so in my final search I would like to run

for i in tree.xpath('//table[XXX]//*/a'):
     print(i.values()[0])

Where XXX is either 9 or 12, in this example.

So back to the title of the question - how do I extract the index number of the table so I can select the higher (or lower) index number for my tree.xpath() expression? Altenatively, is there a way to stop the getpath search at table 9?

Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45
  • 1
    So, do you want to select distinct `href` attributes of `a` element matching some criteria (having de text `'Archives'`)? Python `lxml` module implements XPath 1.0... then you will get the well known XPath 1.0 only quadratic grouping expression. If performance is important I will go with a SAX parser and a set implementation with well manage of string keys. Otherwise, I will get all the `href` with XPath an deduplicate in python. – Alejandro Apr 11 '19 at 13:11
  • @Alejandro Yes, deduplicating in python is doable (and I've done it), but it's just ... inelegant; I'll definitely take a look at SAX parser. Thanks for the tip! – Jack Fleeting Apr 11 '19 at 13:15
  • For simple case I was thinking about somethjing like `set(doc.xpath('//a[contains(.,'Archives')]/@href'))` – Alejandro Apr 11 '19 at 14:34

0 Answers0