-1

Before I start let me point out that I have almost no clue wtf I'm doing. Like imagine a cat that tries to do some coding. I try to write some Python code using Pycharm on Ubuntu 22.04.1 LTS and also used Insomnia if this makes any difference. Here is the code:

`

# sad_scrape_code_attempt.py

import time
import httpx
from playwright.sync_api import sync_playwright

HEADERS = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:106.0) Gecko/20100101 Firefox/106.0",
    "Accept": "*/*",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "Referer": "https://shop.metro.bg/shop/cart",
    "CallTreeId": "||BTOC-1BF47A0C-CCDD-47BB-A9DA-592009B5FB38",
    "Content-Type": "application/json; charset=UTF-8",
    "x-timeout-ms": "5000",
    "DNT": "1",
    "Connection": "keep-alive",

    "Sec-Fetch-Dest": "empty",
    "Sec-Fetch-Mode": "cors",
    "Sec-Fetch-Site": "same-origin"
}

def get_cookie_playwright():
    with sync_playwright() as p:
        browser = p.firefox.launch(headless=False, slow_mo=50)
        context = browser.new_context()
        page = context.new_page()
        page.goto('https://shop.metro.bg/shop/cart')
        page.fill('input#user_id', 'the_sad_cat_username')
        page.fill('input#password', 'the_sad_cat_password')
        page.click('button[type=submit]')
        page.click('button.btn-primary.accept-btn.field-accept-button-name')
        page.evaluate(
            """
            var intervalID = setInterval(function () {
                var scrollingElement = (document.scrollingElement || document.body);
                scrollingElement.scrollTop = scrollingElement.scrollHeight;
            }, 200);

            """
        )
        prev_height = None
        while True:
            curr_height = page.evaluate('(window.innerHeight + window.scrollY)')
            if not prev_height:
                prev_height = curr_height
                time.sleep(1)
            elif prev_height == curr_height:
                page.evaluate('clearInterval(intervalID)')
                break
            else:
                prev_height = curr_height
                time.sleep(1)

# print(context.cookies())
        cookie_for_requests = context.cookies()[11]['value']
        browser.close()
    return cookie_for_requests

def req_with_cookie(cookie_for_requests):
    cookies = dict(
        Cookie=f'BIGipServerbetty.metrosystems.net-80={cookie_for_requests};')
    r = httpx.get('https://shop.metro.bg/ordercapture.customercart.v1/carts/alias/current', cookies=cookies)
    return r.text

if __name__ == '__main__':
    data = req_with_cookie(get_cookie_playwright())
    print(data)

# Used packages

#Playwright
#PyTest
#PyTest-Playwirght
#JavaScript
#TypeScript
#httpx

`

so basically I copy paste the code of 2 tutorials made by John Watson Rooney called:

  1. The Biggest Mistake Beginners Make When Web Scraping
  2. Login and Scrape Data with Playwright and Python

Than combined them and added some JavaScript to scroll to the bottom of the page. Than I found an article called: How Headers Are Used to Block Web Scrapers and How to Fix It

thus replacing "import requests" with "import httpx" and added the HEADERS as per given from Insomnia. From what I understand browsers return headers in certain order and this is an often overlooked web scraper identification method. Primarily because many http clients in various programming languages implement their own header ordering - making identification of web scrapers very easy! If this is true I need to figure out a way to return my cookies header following the correct order, which by the way I have no clue how to figure out but I believe its #11 or #3 judging by the code generated by Insomnia:

`

import requests

url = "https://shop.metro.bg/ordercapture.customercart.v1/carts/alias/current"

querystring = {"customerId":"1001100022726355","cardholderNumber":"1","storeId":"00022","country":"BG","locale":"bg-BG","fsdAddressId":"1001100022726355994-AD0532EI","__t":"1668082324830"}

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:106.0) Gecko/20100101 Firefox/106.0",
    "Accept": "*/*",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "Referer": "https://shop.metro.bg/shop/cart",
    "CallTreeId": "||BTOC-1BF47A0C-CCDD-47BB-A9DA-592009B5FB38",
    "Content-Type": "application/json; charset=UTF-8",
    "x-timeout-ms": "5000",
    "DNT": "1",
    "Connection": "keep-alive",
    "Cookie": "selectedLocale_BG=bg-BG; BIGipServerbetty.metrosystems.net-80=!DHrH53oKfz3YHEsEdKzHuTxiWd+ak6uA3C+dv7oHRDuEk+ScE0MCf7DPAzLTCmE+GApsIOFM2GKufYk=; anonymousUserId=24EE2F84-55B5-4F94-861E-33C4EB770DC6; idamUserIdToken=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6IktfYWE1NTAxNWEtMjA2YS0xMWVkLTk4ZDUtZTJjYzEyYjBkYzUwIn0.eyJleHAiOjE2NjgwODQxMjIsImlhdCI6MTY2ODA4MjMyMiwiYXVkIjoiQlRFWCIsInNlc3Npb25fc3RhdGUiOiJPTnJweFVhOG12WHRJeDR0c3pIZ09GR296WHUyeHZVVzVvNnc3eW1lLUdZLnJMRU1EWGFGIiwiaXNzIjoiaHR0cHM6Ly9pZGFtLm1ldHJvLmJnIiwiZW1haWwiOiJvZmZpY2VAdGVydmlvbi5iZyIsIm5vbmNlIjoiZjg3ZDMyYzEyYTRkNDY1ZGEzYjQwMTQ3OTlkYzc4NzMiLCJjX2hhc2giOiIiLCJzdWIiOiJVX2Y0MjBhY2E4LWY2OTMtNGMxNS1iOTIzLTc1NWY5NTc3ZTIwMCIsImF0X2hhc2giOiJlbkFGRFNJdUdmV0wzNnZ0UnJEQ253IiwicmVhbG0iOiJTU09fQ1VTVF9CRyIsImF1dGhfdGltZSI6MTY2ODA4MjMyMiwiYW1yIjpbIlVTRVJfQ1JFREVOVElBTFMiXX0.AC9vccz5PBe0d2uD6tHV5KdQ8_zbZvdARGUqo5s8KpJ0bGw97vm3xadF5TTHBUwkXX3oyJsbygC1tKvQInycU-zE0sqycIDtjP_hAGf6tUG-VV5xvtRsxBkacTBMy8OmbNHi5oncko7-dZ_tSOzQwSclLZKgKaqBcCqPBQVF0ug4pvbbqyZcw6D-MH6_T5prF7ppyqY11w9Ps_c7pFCciFR965gsO3Q-zr8CjKq1qGJeEpBFMKF0vfwinrc4wDpC5zd0Vgyf4ophzo6JkzA8TiWOGou5Z0khIpl435qUzxzt-WPFwPsPefhg_X9fYHma_OqQIpNjnV2tQwHqBD1qMTGXijtfOFQ; USER_TYPE=CUST; compressedJWT=eNpVUtlyozAQ/CJvcdrhEZtLGGGbQ4BeUlwGiTMhMeCvX5Fkt3YfVKrqme6eaalc7Tozc3Ihth8+Ae8SW/lVrvaYi3AD7Vx0eRwD4pzsuohuG3YsIm/EkUyxDybQjVzqgz2gqnBrj0ZpthNEtzUNqjWl3uqb4xkxA8Z/FhHY+ATHHld+cdFnYbZcZqIPpsflK9PpsBbw4LvfVFYcsh6LzdLJfGbOE+hR8B9ObOmG4FTqLgz4InCs+hhw81Q0BnQsHIQGmBLe3TR/7nzC7fHqmBh6uuIDMpMCuVwm2u2Xf2NbngbWDc9NQ85MpcYnhvcfOejtB5s1B3TMQefyueg9sgit8QlM8cnmc1P+rlF9hpq+QE2dIQUipMnTDRiPLBuvzjtvyISlwbF9KSKe5WH/8Izvnt5rE6FGuYDWsFMmjOa/+zMfLmWegYkEHC0/PO+P9qPYcuzbb5ztwvqVr1061LHzTHX8yDu33XbCnTHlQsgydcesK5iPO2JBvmbk3xpmH6RtNt00YnNQXXBpNV+0UIYU8lCD2ztKOdODQSJcNFVyg2aF60zS2GVvjvQk9lpAh8WliQS1aoVPwPJQn/fbr0vdxRiDJLh7d8pJhzVeNIW+75QK7H0zFVp9Z3BeGmZlA17s5LAcHDgjmc8vO/QiqorcSOenYVEx0/HJATQIqDJxAS7qsKnGQqrrXf5qNaf9GyRl3emruki8vxg0It5IhsxSfI8lGkvl+72qsoNMjhUp75xzR7NRq83w0Pp6oRqg74eq65zPaD/H9TX6GIyDfmFccfA8/fVtkPe7y5AUosA+fpZWBO0l9QzSZIfuoeG2n8aJNKG0WMfoap2XOcVJKT0ex9ep0m9vZv0gJwkqKue+Xb0TZ0Bjz+HMqi9W6Z81h+8PCaRZTJtoFYOun46FkQiPyFmGF65/VX33RdKl+ZYcXDvs7/Nv6PdLkg==; SES2_customerAdr_1001100022726355={%22addressId%22:%221001100022726355994-AD0532EI%22%2C%22addressHash%22:%221001100022726355994-AD0532EI%22%2C%22storeId%22:%2200022%22}; SES2_customerAdr_={%22addressId%22:null%2C%22addressHash%22:null%2C%22storeId%22:%2200022%22}; UserSettings=SelectedStore=1b1fc6ac-2ad6-4243-806e-a4a28c96dff4&SelectedAddress=1001100022726355994-ad0532ei",
    "Sec-Fetch-Dest": "empty",
    "Sec-Fetch-Mode": "cors",
    "Sec-Fetch-Site": "same-origin"
}

response = requests.request("GET", url, headers=headers, params=querystring)

print(response.text)

`

So I'm stuck. Any help or ideas will be greatly appreciated.

Peter
  • 1
  • 2

1 Answers1

0

The page you're navigating to shows this on a GET request:

HTTP ERROR 401 You must provide a http header 'JWT'

This means that this page requires a level of authorization to be accessed.

See JWTs.

"Authorization: This is the most common scenario for using JWT. Once the user is logged in, each subsequent request will include the JWT, allowing the user to access routes, services, and resources that are permitted with that token."

You can access the root page just fine, but once you navigate to more user specific pages or "routes", you will need to provide a JWT to access that page's content.

There is a way to get past this using scraping. You will have to log in to the site as a user using your scraper and collect the JWT which is created by the server and sent back to your client. Then use that JWT in your request's headers:

token = "randomjwtgibberishshahdahdahwdwa"
HEADERS = {
    "Authorization": "Bearer " + token
}
  • I fount the JWT and added token = "EXAMPLE JWT" as well "Authorization": "Bearer " + "EXAMPLE JWT" in HEADERS but still got 401 – Peter Nov 12 '22 at 19:38
  • @Peter you need to specify the token that is sent back to your client when you log in. You cannot use a string you come up with. Please research JWTs further. Let me know if you have any questions. – Jeremiah Gavin May 09 '23 at 17:10