1

However i keep getting this issue in the shell.

 2018-09-13 14:50:36 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
 2018-09-13 14:50:36 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6028
 2018-09-13 14:50:37 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
 2018-09-13 14:50:38 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://localhost:8050/robots.txt> (referer: None)
 2018-09-13 14:51:10 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/js/ via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
 2018-09-13 14:51:36 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 2 pages/min), scraped 0 items (at 0 items/min)
 2018-09-13 14:51:40 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/js/ via http://localhost:8050/render.html> (failed 2 times): 504 Gateway Time-out
 2018-09-13 14:52:00 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://quotes.toscrape.com/js/ via http://localhost:8050/render.html> (failed 3 times): 502 Bad Gateway
 2018-09-13 14:52:00 [scrapy.core.engine] DEBUG: Crawled (502) <GET http://quotes.toscrape.com/js/ via http://localhost:8050/render.html> (referer: None)
 2018-09-13 14:52:00 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <502 http://quotes.toscrape.com/js/>: HTTP status code is not handled or not allowed

Here is my code:

import scrapy
from scrapy_splash import SplashRequest
class MySpider(scrapy.Spider):
    name = "jsscraper"

    start_urls = ["http://quotes.toscrape.com/js/"]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url=url, callback=self.parse, endpoint='render.html')

    def parse(self, response):
        for quote in response.css("div.quote"):
        scraped_info={
         'authorname':quote.css('small.author::text').extract_first(), 
         'quote':quote.css('span.text::text').extract_first(),}
          yield scraped_info

I have installed scrapy-splash and i have also put those commands in settings.py. Also My splash server is running on http://localhost:8050/.

Also when i tried to render any url on splash server i am getting an another error:

HTTP Error 400 (Bad Request) Type: ScriptError -> LUA_ERROR Error happened while executing Lua script

Lua error: [string "function main(splash, args) ..."]:2: network3

I am using:

  • Splash version: 3.2

  • Lua 5.2

yajant b
  • 396
  • 1
  • 4
  • 12
  • What command did you use to start your splash instance? It doesn't seem your problem is in your python code. Also, rendering url on splash UI is probably failing because of the lua script that it is using (UI always use execute endpoint) – Lucas Wieloch Sep 13 '18 at 17:17
  • Can you post a FULL log here? – gangabass Sep 14 '18 at 03:47
  • 1
    @LucasWieloch I have used this command `sudo docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash`. Also what do you mean by _Execute endpoint_ ? p.s: I am new to scrapy and lua. – yajant b Sep 14 '18 at 06:18
  • @yajantb I meant [execute endpoint](https://splash.readthedocs.io/en/stable/api.html#execute) – Lucas Wieloch Sep 14 '18 at 12:12
  • @LucasWieloch Okay, it is render.html, i tried changing it to render.json too. But same error. – yajant b Sep 14 '18 at 12:26
  • I recommend two things - First, go to http://localhost:8050 and try rendering any page like google.com, and then the page you're trying to scrape. Secondly, try starting your splash with something like `docker run -it -p 8050:8050 scrapinghub/splash --max-timeout 3600 --slots 10` – Lucas Wieloch Sep 14 '18 at 12:41
  • @LucasWieloch I have tried rendering that url before. But i get the same issue. I will try the second suggestion and will let you know. – yajant b Sep 14 '18 at 13:15
  • @yajantb You get the same issue when you try rendering it on splash web UI, or do you get `HTTP Error 400 (Bad Request) Type: ScriptError` (as mentioned in question)? If you can't render it on splash web UI then there is something wrong with your splash instance. You could try reinstalling it or [splash cookiecutter](https://github.com/TeamHG-Memex/aquarium) – Lucas Wieloch Sep 14 '18 at 13:20
  • @LucasWieloch Yes, while rendering again i get the same HTTP Error 404. – yajant b Sep 14 '18 at 13:57

0 Answers0