0

I have been trying to parse a webpage using BeautifulSoup. When I import urlopen fromm urllib.request and open https://pbejobbers.com it returns following instead of webpage itself:

<html>
  <body>
    <script src="/aes.min.js" type="text/javascript"></script>
    <script>
         function toNumbers(d){var e=[];d.replace(/(..)/g,function(d){e.push(parseInt(d,16))});return e}function toHex(){for(var d=[],d=1==arguments.length&&arguments[
      0].constructor==Array?arguments[0]:arguments,e="",f=0;f<d.length;f++)e+=(16>d[f]?"0":"")+d[f].toString(16);return e.toLowerCase()}var a=toNumbers("0181cdf0013bf7
      0f89e91be7ef0d00c2"),b=toNumbers("a168ceeade18bccc1cdd77af68ef1753"),c=toNumbers("200a38f39b6a3fe3564acf9bd88c25da");document.cookie="OCXS="+toHex(slowAES.decryp
      t(c,2,a,b))+"; expires=Thu, 31-Dec-37 23:55:55 GMT; path=/";document.location.href="http://pbejobbers.com/product/search?search=USC4215&81e93addddb02a10cd0652f09
      370ae96=1";
    </script>
  </body>
</html>

I have array of UPC codes that I use to find products that I am looking for. I pass the array to a function and parse the html to find necessary tags but I can get to the actual html. Here is my code:

from urllib.request import urlopen
from bs4 import BeautifulSoup

upc_codes = ['USC4215', 'USC4225', 'USC12050']

def retrunh1(upc):
    html = urlopen('https://pbejobbers.com/product/search?search={}'.format(upc))
    soup = BeautifulSoup(html, 'html.parser')
    print(soup.prettify())

if __name__=='__main__':
    for upc in upc_codes:
        retrunh1(upc)

I think the problem is with request function. I isolated it to see what it is return and I am getting the same html back as above when I do this:

import requests

r = requests.get('https://pbejobbers.com')

print(r.text)

I am quite new to web parsing and I need some suggestion on how to resolve this. Thanks

Alex
  • 27
  • 1
  • 12
  • 1
    They don't want you scraping them. So don't! Contact them for access to their data. – xrisk Jan 14 '20 at 04:51
  • You might need to set the `User-Agent` header, perhaps? – Ken Y-N Jan 14 '20 at 05:01
  • I don`t need their data. I am learning web-scraping through team treehouse and came across this site. I only picked it because it lets you search without creating an account. I am just trying to test my skill. – Alex Jan 14 '20 at 14:19

3 Answers3

1

The javascript probably populates the html portion of the page dynamically when the browser starts executing it, so urllib can't download the complete source.

Your python script needs to use a headless browser framework like Selenium to load the page as a browser would and then extract what you need.

As others mentioned, please do not violate their terms of service, especially if the data is private/behind a login page

Emrah Diril
  • 1,687
  • 1
  • 19
  • 27
  • I have heard Selenium. Maybe I need to set timeout and make script wait couple second for a page reload and then try to get html. Thank you for your suggestion – Alex Jan 14 '20 at 14:24
1

when i manually search USC4215, the url is https://pbejobbers.com/product/search?search=USC4215&_rand=0.35863039778309025

The website is appending a random secret _rand to prevent robot web-crawling. u need to make a request with a valid random secret to receive response.

In fact, usually the secret is generated with a set of cookies, if u click Inspect ==> Network ==> Doc and Ctrl + R for refreshing the website, you would find more about the network traffic as you make another request, precisely what is your http request and response content.

chrisckwong821
  • 1,133
  • 12
  • 24
  • I notices _rand as well but I had no idea what it was. I see that website deployed many cookies but I am not sure how _rand is generated. Can you give me general advice on what to look into? Maybe documentation? – Alex Jan 14 '20 at 14:29
1

Please try this.

Python code:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import re

upc_codes = ['USC4215', 'USC4225', 'USC12050']

def retrunh1(upc):
    payload = {'search': upc }
    r = requests.get('https://pbejobbers.com/product', params=payload)
    matches = re.search(r'document\.location\.href=\"(:?.*)=1\";', str(r.text), re.M|re.S)
    url = matches[1]

    response = requests.get(url)

    for resp in response.history:
      r = requests.post(resp.headers['Location'])
      soup = BeautifulSoup(r.content, 'html.parser')
      print(soup.prettify())

if __name__=='__main__':
    for upc in upc_codes:
        retrunh1(upc)

Output:

<div class="page-area-container">
    <div class=" middlebar">
        <div class=" middlebar__left">
            <a class=" logo" href="/">
                <img alt="PBE Jobbers" class=" logo-img" src="/bundles/pjfrontend/pbejobbers/images/logo/pbe-logo.svg?version=9d4c5d60"/>
            </a>
        </div>
        ...
    </div>
    ...
</div>
Neda Peyrone
  • 190
  • 1
  • 5
  • Wow your script works! But I want to understand how you did it. I see that you are using regular expressions. Can you please explain your parameters for regular expression? How come you got the right page? – Alex Jan 14 '20 at 14:41
  • This page is redirect using javascript on page load. To get the new URL I need to extract it from tag using the regular expression, then call the new URL instead of the old URL. But this new URL, I can't retrieve the original data because it has been redirected from some other location through the response header location. So I write the Python code to make a POST request with the correct location header again to retrieve the product data. – Neda Peyrone Jan 14 '20 at 15:38
  • While going through the html, I noticed that it is not return product page. Instead it is returning 404 page of the website. I tried it with different upc codes and it is the same result. What might be the issue? – Alex Jan 14 '20 at 22:07
  • I just realized that I had to add 'search?' to the request url. Now it works! – Alex Jan 14 '20 at 22:16