Python - Extract href value based on content value

Question

I am trying to scan a web page to find the link to a specific product using part of the product name.

The HTML below is the part I am trying to extract information from:

<article class='product' data-json-url='/en/GB/men/products/omia066s188000161001.json' id='product_24793' itemscope='' itemtype='http://schema.org/Product'>
<header>
<h3>OMIA066S188000161001</h3>
</header>
<a itemProp="url" href="/en/GB/men/products/omia066s188000161001"><span content='OFF WHITE Shoes OMIA066S188000161001' itemProp='name' style='display:none'></span>
<span content='OFF WHITE' itemProp='brand' style='display:none'></span>
<span content='OMIA066S188000161001' itemProp='model' style='display:none'></span>
<figure>
<img itemProp="image" alt="OMIA066S188000161001 image" class="top" src="https://cdn.off---white.com/images/156374/product_OMIA066S188000161001_1.jpg?1498806560" />
<figcaption>
<div class='brand-name'>
HIGH 3.0 SNEAKER
</div>
<div class='category-and-season'>
<span class='category'>Shoes</span>
</div>


<div class='price' itemProp='offers' itemscope='' itemtype='http://schema.org/Offer'>
<span content='530.0' itemProp='price'>
<strong>£ 530</strong>
</span>
<span content='GBP' itemProp='priceCurrency'></span>
</div>


<div class='size-box js-size-box'>
<!-- / .available-size -->
<!-- /   = render 'availability', product: product -->
<div class='sizes'></div>
</div>
</figcaption>
</figure>
</a></article>

My code is below:

import requests
from bs4 import BeautifulSoup

item_to_find = 'off white shoes'

s = requests.Session()
r = s.get('https://www.off---white.com/en/GB/section/new-arrivals.js')
soup = BeautifulSoup(r.content, 'html.parser')
#find_url = soup.find("a", {"content":item_to_find})['href']
#print(find_url)

How do I filter only the line where 'content' contains item_to_find and then extract the 'href' for that product?

The final output should look like the below:

/en/GB/men/products/omia066s188000161001

score 2 · Accepted Answer · answered Apr 27 '18 at 18:12

2

Give this a shot.

import requests
from bs4 import BeautifulSoup

item_to_find = 'off white shoes'

s = requests.Session()
r = s.get('https://www.off---white.com/en/GB/section/new-arrivals.js')
soup = BeautifulSoup(r.content, 'html.parser')
links = soup.find_all("a")

for link in links:
    if 'OFF WHITE Shoes' in link.encode_contents():
        print link.get('href')

Since the "OFF WHITE Shoes" text exists within a span we can use encode_contents() to check all of the mark up within each link. If the text we are searching for exists we get the link by using BeautifulSoups .get method.

answered Apr 27 '18 at 18:12

Trevor

109
5

Thank you for looking at this - when I run the code I get the following. print link.get('href') ^ SyntaxError: invalid syntax – Piers Thomas Apr 27 '18 at 18:17
@PiersThomas what version of Python are you using? Try this: `print(link.get('href'))` – Trevor Apr 27 '18 at 18:20
Python V3.6.3 is my current version – Piers Thomas Apr 27 '18 at 18:21
File "t.py", line 36, in if 'OFF WHITE Shoes' in link.encode_contents(): TypeError: a bytes-like object is required, not 'str' – Piers Thomas Apr 27 '18 at 18:21
My bad, I'm using Python version 2.7.10. Logic should still be the same for Python 3 just different syntax. – Trevor Apr 27 '18 at 18:25
@PiersThomas Give this a shot: https://stackoverflow.com/questions/33054527/python-3-5-typeerror-a-bytes-like-object-is-required-not-str-when-writing-t – Trevor Apr 27 '18 at 18:32

score 0 · Answer 2 · answered Apr 27 '18 at 20:34

More specific answer considering python 3 would be:

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

search_item = 'orange timberland'  #make sure the search terms are in small letters (a portion of text will suffice)
URL = 'https://www.off---white.com/en/GB/section/new-arrivals.js'

res = requests.get(URL)
soup = BeautifulSoup(res.text, 'html.parser')
for link in soup.find_all(class_="brand-name"):
    if search_item in link.text.lower():
        item_name = link.get_text(strip=True)
        item_link = urljoin(URL,link.find_parents()[2].get('href'))
        print("Name: {}\nLink: {}".format(item_name,item_link))

Output:

Name: ORANGE TIMBERLAND BOOTS
Link: https://www.off---white.com/en/GB/men/products/omia073s184780161900

Python - Extract href value based on content value

2 Answers2