0

I am scraping data from trip.com . It is a hotel listing website. After entering the details when i click on the search button, the search results are displayed in a new tab with the results being generated dynamically. When i scroll doen the website more results are downloaded and displayed. Now as I understand to generate the data dynamically and scrape it I need to have information about the header of the API returning the JSON value dynamically. But the issue here is this site I am scraping genrates is header param dynamically and in an encrypted format as well. What i mean is this is my request URL:

Request URL: https://www.trip.com/restapi/soa2/16709/json/rateplan?testab=ec23b14de9ad450c7b74612efc288bfdd523314036afe19b5fe135f206284aab

and this is my request header:

:authority: www.trip.com
:method: POST
:path: /restapi/soa2/16709/json/rateplan?testab=ec23b14de9ad450c7b74612efc288bfdd523314036afe19b5fe135f206284aab
:scheme: https
accept: application/json
accept-encoding: gzip, deflate, br
accept-language: en-GB,en-US;q=0.9,en;q=0.8
cache-control: no-cache
content-length: 1697
content-type: application/json
cookie: ibulanguage=EN; cookiePricesDisplayed=USD; ibu_online_home_language_match={"isFromTWNotZh":false,"isFromIPRedirect":false,"isFromLastVisited":false,"isRedirect":false,"isShowSuggestion":false,"lastVisited":""}; _abtest_userid=55c19cf3-dcd6-4f4a-bfba-5965c52ac66c; _tp_search_latest_channel_name=hotels; _RF1=45.115.185.74; _RSG=BJ4Q9HdNV80BpEgEyf8ZZ9; _RDG=286d5feba1bdad2eee089fc228174f22ec; _RGUID=021f5e74-4968-44cb-98e3-229f0ea8eccb; ibulocale=en_us; g_state={"i_p":1600591022929,"i_l":3}; Union=AllianceID=1078337&SID=2036545&OUID=ctag.hash.d23ecf76442c&SourceID=&AppID=&OpenID=&Expires=1602581159329&createtime=1599989159; IBU_TRANCE_LOG_URL=/hotels/mumbai-hotel-detail-762871/grand-hyatt-mumbai/?checkIn=2020-09-14&checkOut=2020-09-15&cityId=724&adult=2&children=0&ages=&crn=1&travelpurpose=0&curr=USD&showtotalamt=0&hoteluniquekey=H4sIAAAAAAAAAOPaycjFK8Fk8B8GGIWYOBilFjNyfJl7U12Iy9DE0sTczNzQwMhgCrNFs44jAwgcaHDwBDMKWh0CeCYxSnKCeef3OAiC6AbVnQ5OrBxr_SRYZjB-P663gpFxIyNEY5LDDkamE4x-C5j-PnnDvIuJleM1uwTTISA9SVCC5RQTwyUmhltMDI-YGF4xMXxiYvgFVdHEzNDFzDCJGaJuFjPDImYGIRaQG6UUjMxTjI0NE00tzYzMTSwT00B0qplJYpKxUXKiuaW5ArdG16GPv1iNGKyYpRjdPBiD2Iwd3SyMXKJkuJg9_YIE4xpqS16d2m4vxRwa7KKoqyj_JSdM2iGJNTVPNyIi4x1LAWMXI5MA4yRGTo7m3U8-Mp5gTAYA1R43aDgBAAA(; librauuid=3lSNuDO18464CG5a; intl_ht1=h4%3D724_762871; hotel=762871; hotelhst=1164390341; _bfa=1.1599889636407.b231b.1.1599996200640.1600004365027.18.57; _bfs=1.1; _bfi=p1%3D10320668147%26p2%3D10320668147%26v1%3D57%26v2%3D56; IBU_TRANCE_LOG_P=22266407054
origin: https://www.trip.com
p: 22266407054
pid: 584e7499-4df6-45dd-8242-94cb5dec36c5
pragma: no-cache
referer: https://www.trip.com/hotels/mumbai-hotel-detail-762871/grand-hyatt-mumbai/?checkIn=2020-09-14&checkOut=2020-09-15&cityId=724&adult=2&children=0&ages=&crn=1&travelpurpose=0&curr=USD&showtotalamt=0&hoteluniquekey=H4sIAAAAAAAAAOPaycjFK8Fk8B8GGIWYOBilFjNyfJl7U12Iy9DE0sTczNzQwMhgCrNFs44jAwgcaHDwBDMKWh0CeCYxSnKCeef3OAiC6AbVnQ5OrBxr_SRYZjB-P663gpFxIyNEY5LDDkamE4x-C5j-PnnDvIuJleM1uwTTISA9SVCC5RQTwyUmhltMDI-YGF4xMXxiYvgFVdHEzNDFzDCJGaJuFjPDImYGIRaQG6UUjMxTjI0NE00tzYzMTSwT00B0qplJYpKxUXKiuaW5ArdG16GPv1iNGKyYpRjdPBiD2Iwd3SyMXKJkuJg9_YIE4xpqS16d2m4vxRwa7KKoqyj_JSdM2iGJNTVPNyIi4x1LAWMXI5MA4yRGTo7m3U8-Mp5gTAYA1R43aDgBAAA(
sec-fetch-dest: empty
sec-fetch-mode: cors
sec-fetch-site: same-origin
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36

Now here the value of testab parameter is generated dynamically when i scroll down in the site. But I am not able to understand how this testab value is being generated. Is it generated byb encrypting the rest of the request header info. FYI, I have all the request header info except the "path" value. So if the value is generated by encryption, how do I proceed with scraping this. Also, I cannot use selenuim or any browser based scraping here.

1 Answers1

2

The testab value is being generated at random using the following JavaScript in the file https://ak-s.tripcdn.com/modules/ibu/ibu-hotel-online/smart/smart.353046e23a610af9fcf9.js

            key: "gencb",
            value: function gencb(r) {
                var o = function() {
                    for (var e = "qwertyuiopasdfg$hjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM", t = "", n = 0; n < 10; n++)
                        t += e.charAt(~~(Math.random() * e.length));
                    return t
                }();
                return window[o] = function(e) {
                    delete window[o];
                    var t = e()
                      , n = "?";
                    r.realUrl && 0 < r.realUrl.indexOf("?") && (n = "&"),
                    r.realUrl += n + "testab=" + encodeURIComponent(t)
                }
                ,
                o
            }

It is then being written to the server on a POST request to https://www.trip.com/restapi/soa2/16709/json/getHotelScript but it is encrypted in the hotelUuidKey unless you are able to crack the encryption you had better render the page using JavaScript.

You say you can't use Selenium or any browser based solution have you looked at PyQt?

https://doc.qt.io/qt-5/qtwebengine-overview.html#qt-webengine-core-module The Qt WebEngine core is based on the Chromium Project. Chromium provides its own network and painting engines and is developed tightly together with its dependent modules. Note: Qt WebEngine is based on Chromium, but does not contain or use any services or add-ons that might be part of the Chrome browser that is built and delivered by Google.

import sys
from PyQt5.QtCore import QUrl
from PyQt5.QtWidgets import QApplication
from PyQt5.QtWebEngineCore import QWebEngineUrlRequestInterceptor
from PyQt5.QtWebEngineWidgets import QWebEngineView, QWebEnginePage, QWebEngineProfile


class WebEngineUrlRequestInterceptor(QWebEngineUrlRequestInterceptor):
    def interceptRequest(self, info):
        if info.requestUrl().url().startswith('https://www.trip.com/restapi/soa2/16709/json/rateplan?testab='):
            print(info.requestUrl().url())
            # Do stuff
            sys.exit()


class MyWebEnginePage(QWebEnginePage):
    def acceptNavigationRequest(self, url, _type, isMainFrame):
        return QWebEnginePage.acceptNavigationRequest(self, url, _type, isMainFrame)


if __name__ == "__main__":
    app = QApplication(sys.argv)
    browser = QWebEngineView()
    interceptor = WebEngineUrlRequestInterceptor()
    profile = QWebEngineProfile()
    profile.setRequestInterceptor(interceptor)
    page = MyWebEnginePage(profile, browser)
    url = 'https://www.trip.com/hotels/mumbai-hotel-detail-762871/grand-hyatt-mumbai/?checkIn=2020-09-14&checkOut=2020-09-15&cityId=724&adult=2&children=0&ages=&crn=1&travelpurpose=0&curr=USD&showtotalamt=0&hoteluniquekey=H4sIAAAAAAAAAOPaycjFK8Fk8B8GGIWYOBilFjNyfJl7U12Iy9DE0sTczNzQwMhgCrNFs44jAwgcaHDwBDMKWh0CeCYxSnKCeef3OAiC6AbVnQ5OrBxr_SRYZjB-P663gpFxIyNEY5LDDkamE4x-C5j-PnnDvIuJleM1uwTTISA9SVCC5RQTwyUmhltMDI-YGF4xMXxiYvgFVdHEzNDFzDCJGaJuFjPDImYGIRaQG6UUjMxTjI0NE00tzYzMTSwT00B0qplJYpKxUXKiuaW5ArdG16GPv1iNGKyYpRjdPBiD2Iwd3SyMXKJkuJg9_YIE4xpqS16d2m4vxRwa7KKoqyj_JSdM2iGJNTVPNyIi4x1LAWMXI5MA4yRGTo7m3U8-Mp5gTAYA1R43aDgBAAA('
    page.setUrl(QUrl(url))
    browser.setPage(page)
    browser.show()
    sys.exit(app.exec_())

Adapted from https://stackoverflow.com/a/50786759/839338 authored by eyllanesc

Outputs the link (and a few warnings) e.g.

https://www.trip.com/restapi/soa2/16709/json/rateplan?testab=15feb5b1067d2e4e2b979fe97830d884c5e3a07*e145f7¼(5955400Z380ac6a6

Updated in response to comment You just need to grab the cookies and make a request. Very quick and dirty code is below.

import requests
import sys
from PyQt5.QtCore import QUrl
from PyQt5.QtWidgets import QApplication
from PyQt5.QtWebEngineCore import QWebEngineUrlRequestInterceptor
from PyQt5.QtWebEngineWidgets import QWebEngineView, QWebEnginePage, QWebEngineProfile
from PyQt5.QtNetwork import QNetworkCookie


class WebEngineUrlRequestInterceptor(QWebEngineUrlRequestInterceptor):
    def __init__(self, on_network_call):
        super().__init__()
        self.on_network_call = on_network_call

    def interceptRequest(self, info):
        if info.requestUrl().url().startswith('https://www.trip.com/restapi/soa2/16709/json/rateplan?testab='):
            self.on_network_call(info)
            sys.exit()


class MyWebEnginePage(QWebEnginePage):
    def acceptNavigationRequest(self, url, _type, isMainFrame):
        return QWebEnginePage.acceptNavigationRequest(self, url, _type, isMainFrame)


def on_network_call(info):
    print(info.requestUrl().url())
    headers = {
        'authority': 'www.trip.com',
        'pragma': 'no-cache',
        'cache-control': 'no-cache',
        'accept': 'application/json',
        'dnt': '1',
        'p': '99783168614',
        'pid': '256f8038-1c06-4173-99b5-880dc120042f',
        'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36',
        'content-type': 'application/json',
        'origin': 'https://www.trip.com',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'cors',
        'sec-fetch-dest': 'empty',
        'referer': 'https://www.trip.com/hotels/mumbai-hotel-detail-762871/grand-hyatt-mumbai/?checkIn=2020-09-14&checkOut=2020-09-15&cityId=724&adult=2&children=0&ages=&crn=1&travelpurpose=0&curr=USD&showtotalamt=0&hoteluniquekey=H4sIAAAAAAAAAOPaycjFK8Fk8B8GGIWYOBilFjNyfJl7U12Iy9DE0sTczNzQwMhgCrNFs44jAwgcaHDwBDMKWh0CeCYxSnKCeef3OAiC6AbVnQ5OrBxr_SRYZjB-P663gpFxIyNEY5LDDkamE4x-C5j-PnnDvIuJleM1uwTTISA9SVCC5RQTwyUmhltMDI-YGF4xMXxiYvgFVdHEzNDFzDCJGaJuFjPDImYGIRaQG6UUjMxTjI0NE00tzYzMTSwT00B0qplJYpKxUXKiuaW5ArdG16GPv1iNGKyYpRjdPBiD2Iwd3SyMXKJkuJg9_YIE4xpqS16d2m4vxRwa7KKoqyj_JSdM2iGJNTVPNyIi4x1LAWMXI5MA4yRGTo7m3U8-Mp5gTAYA1R43aDgBAAA(sec-fetch-dest:%20empty',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
    }

    data = '{"checkIn":"2020-09-15","checkOut":"2020-09-16","priceType":"0","adult":2,"popularFacilityType":"","hotelUniqueKey":"H4sIAAAAAAAAAOPaycjFK8Fk8B8GGIWYOBilFjNyfJl7U12Iy9DE0sTczNzQwMhgCrNFs44jAwgcaHDwBDMKWh0CeCYxSnKCeef3OAiC6AbVnQ5OrBxr_SRYZjB-P663gpFxIyNEY5LDDkamE4x-C5j-PnnDvIuJleM1uwTTISA9SVCC5RQTwyUmhltMDI-YGF4xMXxiYvgFVdHEzNDFzDCJGaJuFjPDImYGIRaQG6UUjMxTjI0NE00tzYzMTSwT00B0qplJYpKxUXKiuaW5ArdG16GPv1iNGKyYpRjdPBiD2Iwd3SyMXKJkuJg9_YIE4xpqS16d2m4vxRwa7KKoqyj_JSdM2iGJNTVPNyIi4x1LAWMXI5MA4yRGTo7m3U8-Mp5gTAYA1R43aDgBAAA(sec-fetch-dest:%20empty","child":0,"roomNum":1,"masterHotelId":762871,"age":"","cityId":"724","hotel":"762871","versionControl":[{"key":"RoomCardVersionB","value":"T"}],"signInRoomKey":"","signInType":0,"filterCondition":null,"unAvailableRoomInfo":null,"minPriceRoomKey":"","Head":{"Locale":"en-XX","Currency":"USD","AID":"","SID":"","ClientID":"1600039009299.2v21ry","OUID":"","CAID":"","CSID":"","COUID":"","TimeZone":"1","PageID":"10320668147","HotelExtension":{"WebpSupport":true,"Qid":"","hasAidInUrl":false,"group":"TRIP","PID":"256f8038-1c06-4173-99b5-880dc120042f","hotelUuidKey":"S96K39i7Te47IA7idYlfYp6E3YLpemawnOWOYhgjs6wZFv0lEPYtNjoSwHSybpjsY1pKL4KazvlLjFYoTvU1YQByTZjUBvc9ed7YG9jHZy5Y1fekTv0NEghwGqWbsenZi8BwMYtY5OInLeo9YmDvFSeDrNbeUZjnkwDfY7bwzSEkY1dRSYX0INbWBYaqYonikdikSiXNj5Y5bjSQi4gYBkwPoJoGRcaYT7woY0ZR7fwa7W6XW4hR7BRqpJT4JMfy9SEcbRgaE4ZEaY4FyfQK11xomETtvc1KQtY3aWGBr90yBXET9vSOvhkyg1E9DJGYUaRkNwG3W9fW6QWf7iDOv5DWqbWFHvfSYHdvdtvOYaXjOcwLkvthjUYAqR9ZwqdjAHW53eZPROqWzSJ3PWPYPnRgqwmFW43jDSePDRBPWtcY3niTYHpRqLwUgWz6WPURD1RUZJ8bJ73ytTEFlWGmW6G","hotelUuid":"dhX4uhn0MdpHusaD"},"Frontend":{"vid":"1600039009299.2v21ry","sessionID":2,"pvid":6},"P":"99783168614","Device":"PC","Version":"0"}}'

    r = requests.post(info.requestUrl().url(), cookies=to_cookie_dict(), data=data, headers=headers)
    print(r.json())


def on_cookie_added(cookie):
    for c in cookies:
        if c.hasSameIdentifier(cookie):
            return
    cookies.append(QNetworkCookie(cookie))


def to_cookie_dict():
    cookie_dict = {}
    for c in cookies:
        cookie_dict[bytearray(c.name()).decode()] = bytearray(c.value()).decode()
    print(cookie_dict)
    return cookie_dict


if __name__ == "__main__":
    app = QApplication(sys.argv)
    browser = QWebEngineView()
    interceptor = WebEngineUrlRequestInterceptor(on_network_call)
    profile = QWebEngineProfile()
    cookie_store = profile.cookieStore()
    cookie_store.cookieAdded.connect(on_cookie_added)
    cookies = []
    profile.setRequestInterceptor(interceptor)
    page = MyWebEnginePage(profile, browser)
    url = 'https://www.trip.com/hotels/mumbai-hotel-detail-762871/grand-hyatt-mumbai/?checkIn=2020-09-14&checkOut=2020-09-15&cityId=724&adult=2&children=0&ages=&crn=1&travelpurpose=0&curr=USD&showtotalamt=0&hoteluniquekey=H4sIAAAAAAAAAOPaycjFK8Fk8B8GGIWYOBilFjNyfJl7U12Iy9DE0sTczNzQwMhgCrNFs44jAwgcaHDwBDMKWh0CeCYxSnKCeef3OAiC6AbVnQ5OrBxr_SRYZjB-P663gpFxIyNEY5LDDkamE4x-C5j-PnnDvIuJleM1uwTTISA9SVCC5RQTwyUmhltMDI-YGF4xMXxiYvgFVdHEzNDFzDCJGaJuFjPDImYGIRaQG6UUjMxTjI0NE00tzYzMTSwT00B0qplJYpKxUXKiuaW5ArdG16GPv1iNGKyYpRjdPBiD2Iwd3SyMXKJkuJg9_YIE4xpqS16d2m4vxRwa7KKoqyj_JSdM2iGJNTVPNyIi4x1LAWMXI5MA4yRGTo7m3U8-Mp5gTAYA1R43aDgBAAA('
    page.setUrl(QUrl(url))
    browser.setPage(page)
    browser.show()
    sys.exit(app.exec_())

Thanks to How to capture the response of a request intercepted by QWebEngineUrlRequestInterceptor? authored by eriel marimon and https://stackoverflow.com/a/48154459/839338 authored by eyllanesc

Dan-Dev
  • 8,957
  • 3
  • 38
  • 55
  • Thanks a lot for the response. The answer is very useful. This solves half of my problem of getting the url. But the issue is that i cannot again use this URL to get the response from the server. So what i essentially want is how do i get the response returned from this request URL. – Aman Mishra Sep 14 '20 at 19:03
  • 1
    I have added very quick and dirty code to address your problem. – Dan-Dev Sep 14 '20 at 20:43
  • 1
    I just noticed the `p` and `pid` headers change but it seems you can leave them out and it still works. Remember to change the dates if you run it after today. – Dan-Dev Sep 14 '20 at 20:57
  • Damn bro, you are awesome thanks alot !!! Just need one last piece of help/info. So this page is dynamically generated i.e. when you scroll down more and more data will be loaded so how do i inculcate this behaviour of scrolling down till the end of the page using PyQT – Aman Mishra Sep 15 '20 at 06:56
  • That's really another one or two questions, first of all you will need to remove `sys.exit()` from `interceptRequest()` this will let it continue to load the page but it will encounter a timeout on one of the URLs which you will have to fix. Then after the page has loaded you will need to run JavaScript on the page to scroll it down. So I would post these as separate questions if I were you, maybe get an answer to the timeout problem first before posting the second. Make sure these questions have not be asked already though or they will get marked as duplicates – Dan-Dev Sep 15 '20 at 09:59