python lxml xpath 1.0 : unique values for element's attribute

Question

Here is a way to get unique values. It doesn't work if i want to get unique attribute. For example:

<a href = '11111'>sometext</a>
<a href = '11121'>sometext2</a>
<a href = '11111'>sometext3</a>

I want to get unique hrefs. Restricted by using xpath 1.0

page_src.xpath( '(//a[not(.=preceding::a)] )')
page_src.xpath( '//a/@href[not(.=preceding::a/@href)]' )

return duplicates. Is it possible to resolve this nightmare with unique-values absence ?

UPD : it's not a solution like function i wanted, but i wrote python function, which iterates over parent elements and check if adding parent tag filters links to needed count.

Here is my example:

_x_item = (
    '//a[starts-with(@href, "%s")'
    'and (not(@href="%s"))'
    'and (not (starts-with(@href, "%s"))) ]'
    %(param1, param1, param2 ))

#rm double links
neededLinks = list(map(lambda vasa: vasa.get('href'), page_src.xpath(_x_item)))
if len(neededLinks)!=len(list(set(neededLinks))):
    uniqLength = len(list(set(neededLinks)))
    breakFlag = False
    for linkk in neededLinks:
        if neededLinks.count(linkk)>1:
            dupLinks = page_src.xpath('//a[@href="%s"]'%(linkk))
            dupLinkParents = list(map(lambda vasa: vasa.getparent(), dupLinks))
            for dupParent in dupLinkParents:
                tempLinks = page_src.xpath(_x_item.replace('//','//%s/'%(dupParent.tag)))
                tempLinks = list(map(lambda vasa: vasa.get('href'), tempLinks))
                if len(tempLinks)==len(set(neededLinks)):
                    breakFlag = True
                    _x_item = _x_item.replace('//','//%s/'%(dupParent.tag))
                    break
            if breakFlag:
                break

This WILL work if duplicate links has different parent, but same @href value.

As a result i will add parent.tag prefix like //div/my_prev_x_item

Plus, using python, i can update result to //div[@key1="val1" and @key2="val2"]/my_prev_x_item , iterating over dupParent.items(). But this is only working if items are not located in same parent object.

In result i need only x_path_expression, so i cant just use list(set(myItems)) .

I want easier solution ( like unique-values() ), if it exists. Plus my solution does not work if link's parent is same.

What version of lxml are you using? Your second xpath works fine for me in version 4.2.1. You _could_ try `//a[not(@href=preceding::a/@href)]/@href` instead, but like I said `//a/@href[not(.=preceding::a/@href)]` works fine for me. — Daniel Haley, Oct 19 '18 at 16:25

score 0 · Answer 1 · answered Oct 19 '18 at 09:34

0

You can extract all the hrefs and then find the unique ones:

all_hrefs = page_src.xpath('//a/@href')
unique_hrefs = list(set(all_hrefs))

answered Oct 19 '18 at 09:34

Vikas Ojha

6,742
6
22
35

i need **x-path only** solution. With python it's easy and i know – Vova Oct 19 '18 at 11:05

python lxml xpath 1.0 : unique values for element's attribute

1 Answers1