1

While scraping a website, I want to get the referer that is pointing to 404s.

def parse_item(self, response):

    if response.status == 404:
        Do something with this > referer=response.request.headers.get('Referer', None)

It is kind of working but the returned referer is always something like:

\x68747470733a2f2f7777772e6162752d64686162692e6d657263656465732d62656e7a2d6d656e612e636f6d2f61722f70617373656e676572636172732f6d657263656465732d62656e7a2d636172732f6d6f64656c732f676c652f636f7570652d633136372f6578706c6f72652e68746d6c

This looks more a memory address than a URL. Am i missing something here?

Thank you !

Bruno

2 Answers2

1

The leading \x escape sequence means the next two characters are interpreted as hex digits for the character code.(What does a leading \x mean in a Python string \xaa)

\x68747470733a2f2f7777772e6162752d64686162692e6d657263656465732d62656e7a2d6d656e612e636f6d2f61722f70617373656e676572636172732f6d657263656465732d62656e7a2d636172732f6d6f64656c732f676c652f636f7570652d633136372f6578706c6f72652e68746d6c

In this case, only one \x, but the following is still a hex string. You can decode it and get the URL. XD

>>> # \x need to be remove from the string
>>> str = '68747470733a2f2f7777772e6162752d64686162692e6d657263656465732d62656e7a2d6d656e612e636f6d2f61722f70617373656e676572636172732f6d657263656465732d62656e7a2d636172732f6d6f64656c732f676c652f636f7570652d633136372f6578706c6f72652e68746d6c'
>>> bytes.fromhex(str)
b'https://www.abu-dhabi.mercedes-benz-mena.com/ar/passengercars/mercedes-benz-cars/models/gle/coupe-c167/explore.html'
>>> bytes.fromhex(str).decode('utf-8')
'https://www.abu-dhabi.mercedes-benz-mena.com/ar/passengercars/mercedes-benz-cars/models/gle/coupe-c167/explore.html'
Y4nhu1
  • 116
  • 11
1

Thanks Yanhui. you unlocked me:

It was more simple than is was expecting:

def parse_item(self, response):

    if response.status == 404:
        Do something with this > referer=response.request.headers.get('Referer', None).decode('utf-8')