1

I am trying to scan a web page to find the link to a specific product using part of the product name.

The HTML below is the part I am trying to extract information from:

<article class='product' data-json-url='/en/GB/men/products/omia066s188000161001.json' id='product_24793' itemscope='' itemtype='http://schema.org/Product'>
<header>
<h3>OMIA066S188000161001</h3>
</header>
<a itemProp="url" href="/en/GB/men/products/omia066s188000161001"><span content='OFF WHITE Shoes OMIA066S188000161001' itemProp='name' style='display:none'></span>
<span content='OFF WHITE' itemProp='brand' style='display:none'></span>
<span content='OMIA066S188000161001' itemProp='model' style='display:none'></span>
<figure>
<img itemProp="image" alt="OMIA066S188000161001 image" class="top" src="https://cdn.off---white.com/images/156374/product_OMIA066S188000161001_1.jpg?1498806560" />
<figcaption>
<div class='brand-name'>
HIGH 3.0 SNEAKER
</div>
<div class='category-and-season'>
<span class='category'>Shoes</span>
</div>


<div class='price' itemProp='offers' itemscope='' itemtype='http://schema.org/Offer'>
<span content='530.0' itemProp='price'>
<strong>£ 530</strong>
</span>
<span content='GBP' itemProp='priceCurrency'></span>
</div>


<div class='size-box js-size-box'>
<!-- / .available-size -->
<!-- /   = render 'availability', product: product -->
<div class='sizes'></div>
</div>
</figcaption>
</figure>
</a></article>

My code is below:

import requests
from bs4 import BeautifulSoup

item_to_find = 'off white shoes'

s = requests.Session()
r = s.get('https://www.off---white.com/en/GB/section/new-arrivals.js')
soup = BeautifulSoup(r.content, 'html.parser')
#find_url = soup.find("a", {"content":item_to_find})['href']
#print(find_url)

How do I filter only the line where 'content' contains item_to_find and then extract the 'href' for that product?

The final output should look like the below:

/en/GB/men/products/omia066s188000161001
Piers Thomas
  • 307
  • 1
  • 2
  • 16

2 Answers2

2

Give this a shot.

import requests
from bs4 import BeautifulSoup

item_to_find = 'off white shoes'

s = requests.Session()
r = s.get('https://www.off---white.com/en/GB/section/new-arrivals.js')
soup = BeautifulSoup(r.content, 'html.parser')
links = soup.find_all("a")

for link in links:
    if 'OFF WHITE Shoes' in link.encode_contents():
        print link.get('href')

Since the "OFF WHITE Shoes" text exists within a span we can use encode_contents() to check all of the mark up within each link. If the text we are searching for exists we get the link by using BeautifulSoups .get method.

Trevor
  • 109
  • 5
  • Thank you for looking at this - when I run the code I get the following. print link.get('href') ^ SyntaxError: invalid syntax – Piers Thomas Apr 27 '18 at 18:17
  • @PiersThomas what version of Python are you using? Try this: `print(link.get('href'))` – Trevor Apr 27 '18 at 18:20
  • Python V3.6.3 is my current version – Piers Thomas Apr 27 '18 at 18:21
  • File "t.py", line 36, in if 'OFF WHITE Shoes' in link.encode_contents(): TypeError: a bytes-like object is required, not 'str' – Piers Thomas Apr 27 '18 at 18:21
  • My bad, I'm using Python version 2.7.10. Logic should still be the same for Python 3 just different syntax. – Trevor Apr 27 '18 at 18:25
  • @PiersThomas Give this a shot: https://stackoverflow.com/questions/33054527/python-3-5-typeerror-a-bytes-like-object-is-required-not-str-when-writing-t – Trevor Apr 27 '18 at 18:32
0

More specific answer considering python 3 would be:

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

search_item = 'orange timberland'  #make sure the search terms are in small letters (a portion of text will suffice)
URL = 'https://www.off---white.com/en/GB/section/new-arrivals.js'

res = requests.get(URL)
soup = BeautifulSoup(res.text, 'html.parser')
for link in soup.find_all(class_="brand-name"):
    if search_item in link.text.lower():
        item_name = link.get_text(strip=True)
        item_link = urljoin(URL,link.find_parents()[2].get('href'))
        print("Name: {}\nLink: {}".format(item_name,item_link))

Output:

Name: ORANGE TIMBERLAND BOOTS
Link: https://www.off---white.com/en/GB/men/products/omia073s184780161900
MITHU
  • 113
  • 3
  • 12
  • 41