1

I am trying to access historical data of this page from 01/01/2018 date in scrapy shell.

After analysis,I figured out that the form data of request is like this

In [124]: form
Out[124]: 
{'action': 'historical_data',
 'curr_id': '44765',
 'end_date': '07/04/2020',
 'header': 'YANG+Historical+Data',
 'interval_sec': 'Daily',
 'smlID': '2520420',
 'sort_col': 'date',
 'sort_ord': 'DESC',
 'st_date': '01/01/2018'}

And request url and headers are like this

In [125]: url
Out[125]: 'https://www.investing.com/instruments/HistoricalDataAjax'

In [126]: head
Out[126]: 
({'name': 'Accept', 'value': 'text/plain, */*; q=0.01'},
 {'name': 'Accept-Encoding', 'value': 'gzip, deflate, br'},
 {'name': 'Accept-Language', 'value': 'en-US,en;q=0.5'},
 {'name': 'Cache-Control', 'value': 'no-cache'},
 {'name': 'Connection', 'value': 'keep-alive'},
 {'name': 'Content-Length', 'value': '172'},
 {'name': 'Content-Type', 'value': 'application/x-www-form-urlencoded'},
 {'name': 'Host', 'value': 'www.investing.com'},
 {'name': 'Origin', 'value': 'https://www.investing.com'},
 {'name': 'Pragma', 'value': 'no-cache'},
 {'name': 'User-Agent',
  'value': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'},
 {'name': 'X-Requested-With', 'value': 'XMLHttpRequest'})

But when I make request,it is redirecting to the home page of the website

In [127]: fetch(scrapy.FormRequest(url,method='POST',headers=head, formdata =for
     ...: m))
2020-07-04 12:39:39 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.investing.com/> from <POST https://www.investing.com/instruments/HistoricalDataAjax>
2020-07-04 12:39:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.investing.com/> (referer: None)

Update:

This header is working fine in developer console and returning correct response but in shell getting 400 error

In [13]: header
Out[13]: 
{'Accept': 'text/plain, */*; q=0.01',
 'Accept-Encoding': 'gzip, deflate, br',
 'Accept-Language': 'en-US,en;q=0.5',
 'Cache-Control': 'no-cache',
 'Connection': 'keep-alive',
 'Content-Length': '172',
 'Content-Type': 'application/x-www-form-urlencoded',
 'Host': 'www.investing.com',
 'Origin': 'https://www.investing.com',
 'Pragma': 'no-cache',
 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0',
 'X-Requested-With': 'XMLHttpRequest'}

I know that I am making mistake somewhere but can't figure out where it is.

I searched a lot, tried various ways like from_request(), Request(url,method='POST', headers=head, body=payload) and posting here was the least choice.

2 Answers2

1

In case if someone else looking for the answer,below is the code I used to overcome above problem

import requests
from bs4 import BeautifulSoup
import pandas as pd
ticker_list = [x.strip() for x in open("symbols.txt", "r").readlines()]

urlheader = {'Accept': 'text/plain, */*; q=0.01',
 'Accept-Encoding': 'gzip, deflate, br',
 'Accept-Language': 'en-US,en;q=0.5',
 'Cache-Control': 'no-cache',
 'Connection': 'keep-alive',
 'Content-Length': '172',
 'Content-Type': 'application/x-www-form-urlencoded',
 'Host': 'www.investing.com',
 'Origin': 'https://www.investing.com',
 'Pragma': 'no-cache',
 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0',
 'X-Requested-With': 'XMLHttpRequest'}
data = []
for ticker in ticker_list:
    print(ticker)
    url = "https://www.investing.com/instruments/HistoricalDataAjax"
    payload = {'action': 'historical_data',
             'curr_id': '44765',
             'end_date': '07/04/2020',
             'header': ticker+'+Historical+Data',
             'interval_sec': 'Daily',
             'smlID': '2520420',
             'sort_col': 'date',
             'sort_ord': 'DESC',
             'st_date': '01/01/2018'}
    req = requests.post(url, headers=urlheader, data=payload)
    soup = BeautifulSoup(req.content, "lxml")

    table = soup.find('table', id="curr_table")
    split_rows = table.find_all("tr")
    header_list = split_rows[0:1]
    split_rows_rev = split_rows[:0:-1]
    for row in header_list:
        key = list(row.stripped_strings)
        key = [column.replace(',','') for column in list(row.stripped_strings)]
        key.append('Symbol')
    for row in split_rows_rev:
        value = [column.replace(',','') for column in list(row.stripped_strings)]
        value.append(ticker)
        res = {key[i]: value[i] for i in range(len(key))}
        data.append(res)
df=pd.DataFrame(data)
df.to_csv('investing.csv',index=False)
0

The headers should be like

headers = {
        'connection': 'keep-alive',
        'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 11_0 like Mac OS X) AppleWebKit/604.1.34 (KHTML, like Gecko) Version/11.0 Mobile/15A5341f Safari/604.1',
        'X-Agent': 'Juejin/Web',
        'Content-Type': 'application/json',
        'Host': 'web-api.juejin.im',
        'Origin': 'https://juejin.im',
    }

for example

Elio
  • 120
  • 1
  • 10
  • I try a website before writing a code in scrapy shell. – Danyal Mughal Jul 04 '20 at 08:11
  • @DanyalMughal I think it's the problem of your header's format, I get the correct response with postman. – Elio Jul 04 '20 at 08:30
  • I tried this header in scrapy shell, getting 403 error. – Danyal Mughal Jul 04 '20 at 08:37
  • @DanyalMughal The one I posted is not for this website.. You shall rewrite yours in this format. – Elio Jul 04 '20 at 08:44
  • I analysed with Charles and I think the problem may lie in the Content-type, the request body will be transformed to something like ```----------------------------067532888719910114752634 Content-Disposition: form-data; name="curr_id" 44765 ----------------------------0675328887199101147``` but with Scrapy I tried both `urlencode` and `FormRequest` but didn't achieve that. – Elio Jul 05 '20 at 05:33