2

I am trying to scrape Instagram with selenium using chrome webdriver. I need to get XHR response info and I tried "browsermob-proxy" and that info wasn't enough:

server = Server("/home/doruk/Downloads/browsermob-proxy 2.1.4/bin/browsermob-proxy")
server.start()
time.sleep(1)
proxy = server.create_proxy()
time.sleep(1)

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--proxy-server={0}".format(proxy.proxy)) 
browser = webdriver.Chrome(chrome_options=chrome_options)

##############################################
####This is output of proxy.har in json format.
 {
    "comment": "", 
    "serverIPAddress": "155.245.9.55", 
    "pageref": "", 
    "startedDateTime": "2018-05-21T16:44:41.053+03:00", 
    "cache": {}, 
    "request": {
      "comment": "", 
      "cookies": [], 
      "url": "https://scontent-sof1-1.cdninstagram.com/vp/e95312434013bc43a5c00c458b53022cb/5BC46751/t51.2885-19/s150x150/26432586_139925760144086_726193654523232256_n.jpg", 
      "queryString": [], 
      "headers": [], 
      "headersSize": 528, 
      "bodySize": 0, 
      "method": "GET", 
      "httpVersion": "HTTP/1.1"
    }, 

when I click "Load More Comments" in a content, a link like this

https://www.instagram.com/graphql/query/?query_hash=33ba35000cb50da46f5b5e889df7d159&variables=%7B"shortcode"%3A"Bi9ZURdA6Gn"%2C"first"%3A36%2C"after"%3A"AQBr-wP7U4Ykr1QRH7PYJ1a0KQivhS0Ndwae-5F8vrZ5sf1eA_Bfgn4dZ0ql0pwUf9GXPm_LPyhtCnlhH6YOHfuNstwXK9VZuUIR4zD3k24s6Q"%7D

shows up and I need that info inside of it. Is there any way to handle this situation?

I just need the "?query_hash=" thing.

Example view

Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
doruksahin
  • 332
  • 1
  • 5
  • 11
  • After you click on the link in question, can you wait say 10 seconds and export the HAR, I know its silly but sometimes, lot of requests are happening at background and may be you are exporting the HAR before the information you are looking for is yet to be captured by browsermob-proxy – Satish May 21 '18 at 14:59

1 Answers1

2

I've done it! The trick for me was just to wait the entire loading of the page. Not the DOM ready state page for me continues loading. There is a way to remove the arbitrary sleep and ask the driver for the real complete loading of the page. I do not recall the code... I've to search.

from browsermobproxy import Server
import json
from selenium import webdriver
import time

urle = "https://www.yoururl.com";

server = Server(path="./browsermob-proxy-2.1.4/bin/browsermob-proxy")
server.start()
proxy = server.create_proxy()
profile = webdriver.FirefoxProfile()
profile.set_proxy(proxy.selenium_proxy())
driver = webdriver.Firefox(firefox_profile=profile, executable_path='./geckodriver')
proxy.new_har(urle, options={'captureHeaders': True, 'captureContent':True})
driver.get(urle)
time.sleep(10)
result = json.dumps(proxy.har, ensure_ascii=False)
print result
proxy.stop()
driver.quit()
Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33