Questions tagged [scrapinghub]

a web scraping development and services company, supplies cloud-based web crawling platforms.

179 questions
1
vote
0 answers

scrapinghub requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://storage.scrapinghub.com

I am trying to run scrapy_price_monitor in local environment, but when I give the command "scrapy crawl spidername", it returns "unauthorized" when trying to send the item to storage.scrapinghub. I have already succesfully "shub login" (added my…
pedrovgp
  • 767
  • 9
  • 23
1
vote
0 answers

Scrapy: settings, multiple concurrent spiders, and middlewares

I'm used to running spiders one at a time, because we mostly work with scrapy crawl and on scrapinghub, but I know that one can run multiple spiders concurrently, and I have seen that middlewares often have a spider parameter in their…
kenshin
  • 197
  • 11
1
vote
0 answers

Why Splash headless browser can not able to fetch the page of Linkedin

I have tried to get the page source of Linkedin. But I cannot able to fetch even for one URL. I got the response like "Failed loading page" Few samples, https://www.linkedin.com/company/amazon https://www.linkedin.com/company/apple Splash version:…
1
vote
1 answer

How to scrape multiple websites with different data in urls

I'm scraping some data from a webpage where at the end of the url has the id of the product, it appears to rewrite the data at every single row, like its not appending the data from the next line, I don't know exactly what's going on, if my first…
1
vote
1 answer

mysql.connector.errors.InterfaceError: 2003: Can't connect to MySQL server on '127.0.0.1:3306' on Scrapinghub

I try to run my spider on scrapinghub, and run it getting an error Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks result = g.send(result) File…
Biddaris
  • 33
  • 1
  • 4
1
vote
1 answer

"'str' object has no attribute 'get'" when using Google Cloud Storage with ScrapingHub

I'm trying to get Google Cloud Storage working with a Scrapy Cloud + Crawlera project so that I can save text files I'm trying to download. I'm encountering an error when I run my script that seems to have to do with my Google permissions not…
1
vote
1 answer

Scrapy throuws Exception "raise _DefGen_Return(val) twisted.internet.defer._DefGen_Return: "

When I run the code locally (windows 10) everything works fine. Have checked other answers here and other resources, but failed to figure out any solution. After deploying to ScrapingHub Im getting this error message: [scrapy.core.scraper] Spider…
Billy Jhon
  • 1,035
  • 15
  • 30
1
vote
2 answers

How to dynamically upload data from Scrapinghub to Wordpress?

I am running periodic spiders in Scrapy Cloud and exporting the results to an AWS S3 Bucket. I need to dynamically upload my Wordpress tables with these results and I am currently using TablePress plugin which has a "Import tables" option but it…
Jorge Garcia
  • 117
  • 9
1
vote
1 answer

Connection was refused by other side: 111: Connection refused

I have a spider for LinkedIn. It is working fine on my local machine but when I deploy on Scrapinghub I got error: Error downloading : Connection was refused by other side: 111: Connection refused. The complete log of…
Alpha Romeo
  • 84
  • 10
1
vote
1 answer

scrapy request duration gradualy higher when scraping lots of different domains on scrappinghub

I'm using scrapy, on scrappinghub, to scrap a few thousands websites. When scraping a single website, requests durations are kept pretty short (< 100ms). But I also have a spider that is responsible for 'validating' around 10k urls (I'm testing a…
romain-lavoix
  • 403
  • 2
  • 6
  • 20
1
vote
1 answer

Keyword async error running shub command

I got my spiders ready, and now I want to deploy them to scrapinghub. I've succesfully installed shub running pip3 install shub. Im using python 3.7. But when I run shub, I get a syntax error. I can see that this issue should be fixed in the latest…
jonask
  • 679
  • 2
  • 6
  • 21
1
vote
0 answers

Scrapinghub/Splash website page fetching time increasing exponentially with parellel thread

In my trial, I hit splash instance with 50 parallel threads. Each thread will get the page source of the URL. My splash instance default slots value is 50. Here, website fetching time increases exponentially with the number of parallel threads. I…
1
vote
0 answers

Scrapinghub/Splash - Aquarium is not working on docker-compose

We are trying to use Aquarium, to set up the Scrapinghub/Splash. While installation, when I use "docker-compose up" to start the Splash, It throws the exception Traceback (most recent call last): File "/usr/local/bin/docker-compose", line 11,…
1
vote
1 answer

Scrapy and Splash right settings but still got Connection error

Under my settings.py SPLASH_URL = 'http://127.0.0.1:8050' DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':…
user1441797
  • 134
  • 1
  • 1
  • 10