1

Here is a way to get unique values. It doesn't work if i want to get unique attribute. For example:

<a href = '11111'>sometext</a>
<a href = '11121'>sometext2</a>
<a href = '11111'>sometext3</a>

I want to get unique hrefs. Restricted by using xpath 1.0

page_src.xpath( '(//a[not(.=preceding::a)] )')
page_src.xpath( '//a/@href[not(.=preceding::a/@href)]' )

return duplicates. Is it possible to resolve this nightmare with unique-values absence ?

UPD : it's not a solution like function i wanted, but i wrote python function, which iterates over parent elements and check if adding parent tag filters links to needed count.

Here is my example:

_x_item = (
    '//a[starts-with(@href, "%s")'
    'and (not(@href="%s"))'
    'and (not (starts-with(@href, "%s"))) ]'
    %(param1, param1, param2 ))

#rm double links
neededLinks = list(map(lambda vasa: vasa.get('href'), page_src.xpath(_x_item)))
if len(neededLinks)!=len(list(set(neededLinks))):
    uniqLength = len(list(set(neededLinks)))
    breakFlag = False
    for linkk in neededLinks:
        if neededLinks.count(linkk)>1:
            dupLinks = page_src.xpath('//a[@href="%s"]'%(linkk))
            dupLinkParents = list(map(lambda vasa: vasa.getparent(), dupLinks))
            for dupParent in dupLinkParents:
                tempLinks = page_src.xpath(_x_item.replace('//','//%s/'%(dupParent.tag)))
                tempLinks = list(map(lambda vasa: vasa.get('href'), tempLinks))
                if len(tempLinks)==len(set(neededLinks)):
                    breakFlag = True
                    _x_item = _x_item.replace('//','//%s/'%(dupParent.tag))
                    break
            if breakFlag:
                break

This WILL work if duplicate links has different parent, but same @href value.

As a result i will add parent.tag prefix like //div/my_prev_x_item

Plus, using python, i can update result to //div[@key1="val1" and @key2="val2"]/my_prev_x_item , iterating over dupParent.items(). But this is only working if items are not located in same parent object.

In result i need only x_path_expression, so i cant just use list(set(myItems)) .

I want easier solution ( like unique-values() ), if it exists. Plus my solution does not work if link's parent is same.

Vova
  • 563
  • 8
  • 20
  • What version of lxml are you using? Your second xpath works fine for me in version 4.2.1. You _could_ try `//a[not(@href=preceding::a/@href)]/@href` instead, but like I said `//a/@href[not(.=preceding::a/@href)]` works fine for me. – Daniel Haley Oct 19 '18 at 16:25
  • 1.0 (python lxml.xpath function) – Vova Oct 22 '18 at 08:52

1 Answers1

0

You can extract all the hrefs and then find the unique ones:

all_hrefs = page_src.xpath('//a/@href')
unique_hrefs = list(set(all_hrefs))
Vikas Ojha
  • 6,742
  • 6
  • 22
  • 35