5

I am trying to scrape this website "https://www.ticketweb.com/search?q=", but even though I can see the HTML elements in the inspector and download the webpage when I request it via Python, I only get that error.

Here is what I have in my script:

import requests

url_path = r'https://www.ticketweb.com/search?q='

HEADERS = {
    "Accept": "*/*",
    "Accept-Encoding": "utf-8",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
}

response = requests.get(url_path, headers=HEADERS)

content = response.text

print(content)

Here is the response:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
  <head>
    <title>506 Invalid request</title>
  </head>
  <body>
    <h1>Error 506 Invalid request</h1>
    <p>Invalid request</p>
    <h3>Error 54113</h3>
    <p>Details: cache-dfw-kdfw8210093-DFW 1678372070 120734701</p>
    <hr>
    <p>Varnish cache server</p>
  </body>
</html>
General Grievance
  • 4,555
  • 31
  • 31
  • 45

3 Answers3

4

Whenever you see 506, Rest assured that the issue from the client you are using where the server is unable to handle your request. Since you are using requests which is clearly sending a native http request, Where the server-end server the request based on specific TLS and JA3 pattern, Then you've to sort that.

For instance, Calling https://tls.browserleaks.com/json will giving different JA3 from selenium and requests, The reason behind that is TLS.

You have to use TLS client for that since JA3 is playing a game here within the ciphers, Otherwise inject requests with TLS v1.2 and do some modifications.

In addition, You can use Curl-CFFI as well --> https://pypi.org/project/curl-cffi/

enter image description here

import tls_client


headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'accept-language': 'en-US,en;q=0.5',
}


def main():
    req = tls_client.Session(client_identifier="firefox113")
    req.headers.update(headers)
    params = {
        'page': 1
    }
    r = req.get(
        "https://www.ticketweb.com/venue/the-new-parish-oakland-ca/428995", params=params)
    print(r)


if __name__ == "__main__":
    main()

Output:

200

you should have the full response.

Same for requests as well:

import ssl
import requests

from requests.adapters import HTTPAdapter
from urllib3.poolmanager import PoolManager
from urllib3.util.ssl_ import create_urllib3_context

CIPHERS = "ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384"


headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'accept-language': 'en-US,en;q=0.5',
}


class TlsAdapter(HTTPAdapter):
    def __init__(self, ssl_options=0, **kwargs):
        self.ssl_options = ssl_options
        super(TlsAdapter, self).__init__(**kwargs)

    def init_poolmanager(self, *pool_args, **pool_kwargs):
        ctx = create_urllib3_context(
            ciphers=CIPHERS, cert_reqs=ssl.CERT_REQUIRED, options=self.ssl_options)
        self.poolmanager = PoolManager(
            *pool_args, ssl_context=ctx, **pool_kwargs)


def main():
    adapter = TlsAdapter(ssl.OP_NO_TLSv1 | ssl.OP_NO_TLSv1_1)
    with requests.session() as req:
        req.mount("https://", adapter)
        req.headers.update(headers)
        params = {
            'page': 1
        }
        r = req.get(
            "https://www.ticketweb.com/venue/the-new-parish-oakland-ca/428995", params=params)
        print(r)


if __name__ == "__main__":
    main()

Output:

200
  • 1
    Your answer seems really interesting but I confess it is not very clear to me. Any chance you can detail / improve it a bit ? I'm pretty sure what you're describing is of huge interest for all the apprentice scrappers out there ! – R_D May 27 '23 at 16:32
  • 1
    @R_D Basically, what he's saying is that you can impersonate browsers with JA3 fingerprinting. JA3 is a standard for creating SSL client fingerprints that can be used to identify the client application that initiated a TLS connection. JA3 fingerprints the way that a client application communicates over TLS and JA3S fingerprints the server response. Combined, they essentially create a fingerprint of the cryptographic negotiation between client and server – baduker May 27 '23 at 17:42
  • How do you explain the behavior that first plain requests are working, then only modifying the request headers, and now needing impersonation techniques? And what about the site's terms of use, which actually prohibit automated content extraction? – colidyre May 29 '23 at 11:43
  • @colidyre what ? _How do you explain the behavior that first plain requests are works_ ? and who said that editing the headers is the solution ? ! Lastly, regarding the site's terms of use. if you follow site TOS, why you did posted an answer ? also web scraping is not against SO TOS – αԋɱҽԃ αмєяιcαη May 29 '23 at 12:21
3

You can use the Google cache url to get to the site.

https://webcache.googleusercontent.com/search?q=cache:

Also, the data for the venue and event sits in a single <script> element and can be easily parsed.

The link I've used: https://www.ticketweb.com/venue/the-new-parish-oakland-ca/428995?page=1

For example:

import json
from datetime import datetime

import pandas as pd
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.50"
}

google_cache_url = "https://webcache.googleusercontent.com/search?q=cache:"


def parse_date(event_date: str) -> str:
    return (
        datetime
        .strptime(event_date, "%Y-%m-%dT%H:%M")
        .strftime("%Y-%m-%d at %H:%M")
    )


def show_performers(performers: list) -> str:
    return ", ".join([performer["name"] for performer in performers])


def parse_event(script_element: str) -> list:
    venue_events = json.loads(script_element)
    parsed = []
    for event in venue_events:
        parsed.append(
            [
                event["name"],
                show_performers(event["performer"]),
                parse_date(event["startDate"]),
                event["offers"]["availability"],
                event["url"],
            ]
        )
    return parsed


if __name__ == "__main__":
    ticket_web_url = "https://www.ticketweb.com/venue/the-new-parish-oakland-ca/428995?page=1"

    response = requests.get(google_cache_url + ticket_web_url, headers=HEADERS)
    script = (
        BeautifulSoup(response.text, "html.parser")
        .select_one("script[type='application/ld+json']")
        .string
    )

    venue_table = pd.DataFrame(
        parse_event(script),
        columns=["Event", "Performers", "When", "Status", "URL"],
    )
    print(tabulate(venue_table, headers="keys", tablefmt="psql", showindex=False))

Prints:

+-----------------------------------------------------+------------------------------------------------+---------------------+----------+-----------------------------------------------------------------------------------------------------+
| Event                                               | Performers                                     | When                | Status   | URL                                                                                                 |
|-----------------------------------------------------+------------------------------------------------+---------------------+----------+-----------------------------------------------------------------------------------------------------|
| Homixide Gang: Snot of Not Tour                     | Homixide Gang,  Sid Shyne, Biggaveli, Lil He77 | 2023-05-25 at 20:00 | SoldOut  | https://www.ticketweb.com/event/homixide-gang-snot-of-not-the-new-parish-tickets/13096395           |
| La Sonora Dinamita, Suenatron, El Dusty             | LA Sonora Dinamita, Suenatron, El Dusty        | 2023-05-27 at 21:00 | InStock  | https://www.ticketweb.com/event/la-sonora-dinamita-suenatron-el-the-new-parish-tickets/13220068     |
| Reggae Gold XL presents: The Give Thankz Reunion    | Reggae Gold XL                                 | 2023-05-28 at 21:00 | InStock  | https://www.ticketweb.com/event/reggae-gold-xl-presents-the-the-new-parish-tickets/13228028         |
| THE OFFICIAL OAKLAND CARNIVAL AFTER-PARTY           | Oakland Carnival, SambaFunk, Kenny Mann        | 2023-06-03 at 22:00 | InStock  | https://www.ticketweb.com/event/the-official-oakland-carnival-after-the-new-parish-tickets/13236848 |
| WARD DAVIS                                          | Ward Davis                                     | 2023-06-08 at 20:00 | InStock  | https://www.ticketweb.com/event/ward-davis-the-new-parish-tickets/13127855                          |
| Casey Veggies                                       | Casey Veggies                                  | 2023-06-09 at 20:30 | InStock  | https://www.ticketweb.com/event/casey-veggies-the-new-parish-tickets/13151618                       |
| Casey Veggies                                       | Casey Veggies                                  | 2023-06-09 at 21:00 | InStock  | https://www.ticketweb.com/event/casey-veggies-the-new-parish-tickets/13160998                       |
| Mortified presents: Morti-Pride!                    | MORTIFIED                                      | 2023-06-10 at 19:30 | InStock  | https://www.ticketweb.com/event/mortified-presents-morti-pride-the-new-parish-tickets/13126705      |
| ZelooperZ: Traptastic Tour                          | ZelooperZ                                      | 2023-06-13 at 20:00 | InStock  | https://www.ticketweb.com/event/zelooperz-traptastic-tour-the-new-parish-tickets/13205488           |
| Ab-Soul: The Intelligent Movement Tour              | Ab-Soul                                        | 2023-06-14 at 20:00 | InStock  | https://www.ticketweb.com/event/ab-soul-the-intelligent-movement-the-new-parish-tickets/13156258    |
| Ab-Soul: The Intelligent Movement Tour              | Ab-Soul                                        | 2023-06-15 at 20:00 | SoldOut  | https://www.ticketweb.com/event/ab-soul-the-intelligent-movement-the-new-parish-tickets/13108785    |
| THE COLORS OF BANG YONG GUK: THE US TOUR 2023       | BANG YONGGUK                                   | 2023-06-16 at 19:00 | InStock  | https://www.ticketweb.com/event/the-colors-of-bang-yong-the-new-parish-tickets/13115845             |
| THE COLORS OF BANG YONG GUK: THE US TOUR 2023       | Bang Yongguk                                   | 2023-06-16 at 19:00 | InStock  | https://www.ticketweb.com/event/the-colors-of-bang-yong-the-new-parish-tickets/13115565             |
| BashfortheWorld                                     | BashfortheWorld                                | 2023-06-17 at 21:00 | SoldOut  | https://www.ticketweb.com/event/bashfortheworld-the-new-parish-tickets/13116985                     |
| Frank Zappa Tribute with The Stinkfoot Orchestra    | The Stinkfoot Orchestra                        | 2023-06-23 at 20:00 | InStock  | https://www.ticketweb.com/event/frank-zappa-tribute-with-the-the-new-parish-tickets/13198478        |
| Hip Hop For The People's Health And Wellness Summit | Inspectah Deck                                 | 2023-06-25 at 21:00 | InStock  | https://www.ticketweb.com/event/hip-hop-for-the-peoples-the-new-parish-tickets/13161098             |
| 03 Greedo                                           | 03 Greedo                                      | 2023-06-28 at 20:00 | InStock  | https://www.ticketweb.com/event/03-greedo-the-new-parish-tickets/13155688                           |
| 03 Greedo                                           | 03 Greedo                                      | 2023-06-29 at 20:00 | InStock  | https://www.ticketweb.com/event/03-greedo-the-new-parish-tickets/13175908                           |
| K-Pop Mixtape Party                                 | Alawn                                          | 2023-07-01 at 20:30 | InStock  | https://www.ticketweb.com/event/k-pop-mixtape-party-the-new-parish-tickets/13241338                 |
| LOJAY - GANGSTER ROMANTIC                           | Lojay                                          | 2023-07-05 at 20:00 | InStock  | https://www.ticketweb.com/event/lojay-gangster-romantic-the-new-parish-tickets/13234278             |
+-----------------------------------------------------+------------------------------------------------+---------------------+----------+-----------------------------------------------------------------------------------------------------+
baduker
  • 19,152
  • 9
  • 33
  • 56
  • 1
    Hmm, You will never have an up-to-date information like that way. – αԋɱҽԃ αмєяιcαη May 27 '23 at 12:07
  • 1
    BTW, Varnish has nothing to do with CloudFlare, CloudFlare is a thing and Varnish is different thing, But you can use both of them together. Varnish is an HTTP Cache while CloudFlare is a web security. The error raise by the cache server indicate the server is unable to serve the request. – αԋɱҽԃ αмєяιcαη May 27 '23 at 12:18
  • 1
    The site's rarely updated, so for this use-case the cache is enough, as it returns the correct data. – baduker May 27 '23 at 12:19
  • I see, but that's taken you to interact with Google where you are going to be blocked on long-run unless you handle that with proxies and some other techniques, Based on that the real solution is to handle your initial request to the targeted site itself. – αԋɱҽԃ αмєяιcαη May 27 '23 at 12:24
2

It seems that the request header is being critically scrutinized. I have played a bit with the request header, and e.g. this was a successful request at the time of writing this answer:

import requests

url_path = r'https://www.ticketweb.com/search?q='

HEADERS = {
    "Accept-Language": "en-US,en",
    "Accept": "*/*;q=0.9",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
}

response = requests.get(url_path, headers=HEADERS)
response.raise_for_status()
print(response.text)

Here is a good explanation about the q-Parameter in the request header. tldr; (as far as I understood this) It indicates that the instruction is not handled quite so strictly, which you accept as the requester.

I came to the solution by copying the complete request header from a firefox request and tried to minimize it as far as I can, also played a bit with the q-Parameter as already mentioned.

EDIT: In the meanwhile this request is not working anymore

Important note

If you read the terms of use on the page, you will see something like this:

[...] you agree that you will not:

  • Use any robot, spider [...]
  • Use any automated software or computer system to search for [...]

So it is very likely that the site owners analyzing some criteria to see if a request is made from a browser or from a machine. If they assume that a computer program is accessing the site, they can block or manipulate the response (e.g. returning an empty result or returning an arbitrary status code like 506 or even 418 if they want).

That means: Web scraping can fail at any time. Especially if the site owners don't want you to download their content automatically, because site operators can always come up with new things to prevent automated access.

If you are allowed to download the content, you will have to do more work, e.g. use selenium web driver, consider cookies, humanize the request times and maybe not always use the same IP address for the automated accesses, using caches from the site etc.

This is hard to do with purely the requests library or only using curl. So instead of faking a human request, why not using a browser and doing the request for you?

Here is an example how to request via Selenium's Browser. This should work for url https://www.ticketweb.com/search?q=taylor+swift and driver.find_element(by=By.TAG_NAME, value="body"). The browser can also be used headless by injecting --headless to the browser options, so no need to see the browser UI during the process.

But again: Web scraping can fail at any time and please read carefully the terms of use if you are allowed to read the page automatically at all.

BTW: utf-8 is not listed as Accept-Encoding parameter here. But it seems that you don't need it anyways.

colidyre
  • 4,170
  • 12
  • 37
  • 53
  • 1
    I get the this `requests.exceptions.HTTPError: 506 Server Error: Switching protocols (100) for url: https://www.ticketweb.com/search?q=taylor+swift` – Life is complex May 26 '23 at 11:44
  • Same for me, so I have updated the answer to reflect this behavior (the request worked at the time of writing this answer). I also included another way of getting the data (via Selenium), but it has to be checked if it is allowed to do so. – colidyre May 27 '23 at 10:44
  • 1
    The detailed information which has been described is really out of the main problem at all. – αԋɱҽԃ αмєяιcαη May 27 '23 at 12:08
  • 1
    @αԋɱҽԃαмєяιcαη I see it differently. A reference to the terms of use is appropriate in my opinion. Furthermore, I was able to show initially that changing the request header was successful. After a certain time, it was no longer. So we have here the classic arms race of crawlers against site operators. – colidyre May 29 '23 at 08:29
  • @colidyre check my answer below to understand the real issue for the site. – αԋɱҽԃ αмєяιcαη May 29 '23 at 10:11
  • 1
    I think most people spending significant time writing scrapers understand that they can fail at any time, so yes this whole question is about how to overcome those restrictions. Fwiw, using Selenium by itself does not fix the problem. – max pleaner May 31 '23 at 18:07