a web scraping development and services company, supplies cloud-based web crawling platforms.
Questions tagged [scrapinghub]
179 questions
1
vote
2 answers
How to use Crawlera with selenium (Python, Chrome, Windows) without Polipo
So basically i am trying to use the Crawlera Proxy from scrapinghub with selenium chrome on windows using python.
I checked the documentation and they suggested using Polipo like this:
1) adding the following lines to /etc/polipo/config
parentProxy…

Emilz
- 73
- 1
- 8
1
vote
1 answer
Scrapy Prevent Visiting Same URL Across Schedule
I am planning on deploying a Scrapy spider to ScrapingHub and using the schedule feature to run the spider on a daily basis. I know that, by default, Scrapy does not visit the same URLs. However, I was wondering if this duplicate URL avoidance is…

Marcus Christiansen
- 187
- 2
- 12
1
vote
0 answers
Scraping Hub Periodic Script / IOError No such file or directory
I am trying to run a periodic script and connect it with a json file within my project. I tried this (https://support.scrapinghub.com/support/solutions/articles/22000200416-deploying-non-code-files) but this is not working for me, structure imported…

nicolasdavid
- 2,821
- 4
- 18
- 22
1
vote
0 answers
Modules folder in Scrapinghub
I'm currently using Scrapinghub's Scrapy Cloud to host my 12 spiders (and 12 differnet projects).
I'd like to have one folder with functions that are used by all 12 spiders but not sure what the best way to implement it without having 1 functions…

Axel Eriksson
- 105
- 1
- 11
1
vote
2 answers
Crawlera: 407 "Bad Auth" error message
Using Crawlera's sample code for a GET request with a proxy.
import requests
url = "http://httpbin.org/ip"
proxy_host = "proxy.crawlera.com"
proxy_port = "8010"
proxy_auth = ":" # Make sure to include ':' at the end
proxies = {
…

Joseph D.
- 11,804
- 3
- 34
- 67
1
vote
1 answer
Deploy failed because multiple spiders with Scrapinghub
I create a project with scrapy and save data to my mongodb. It can work.
Here is my code:
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
import time
# scrapy api imports
from scrapy.crawler import CrawlerProcess
from…

Morton
- 5,380
- 18
- 63
- 118
1
vote
1 answer
How to let Scrapy access Tor after deploy to Scapinghub
I had configure the spider to access the Tor with setup Privoxy but this only work when I use in localhost as the setting I configure is pointed to 127.0.0.1: port. But when i deploy to the Scapinghub, the server side do not setup tor and privoxy as…

Terence Goh
- 13
- 1
- 4
1
vote
1 answer
Ignoring requests while scraping two pages
I am now scraping this website on a daily basis, and am using DeltaFetch to ignore pages which have already been visited (a lot of them).
The issue I am facing is that for this website, I need to first scrape page A, and then scrape page B to…

Abel Riboulot
- 158
- 1
- 8
1
vote
1 answer
Scrapy, Scrapinghub and Google Cloud Storage: Keyerror 'gs' while running the spider on scrapinghub
I'm working on a scrapy project using Python 3 and the spiders are deployed to scrapinghub. I'm also using Google Cloud Storage to store the scraped files as mentioned in the official doc here.
The spiders are running absolutely fine when i'm…

Sagar Singh Verma
- 55
- 11
1
vote
2 answers
Set variable on shub deploy project
I'm trying to setup scrapy settings to work with test and production environment on local and also on scrapinghub.
And I would like to know if there is any way to set this variable (for example as the following) on shub deploy:
And then at…

Alberto
- 1,423
- 18
- 32
1
vote
1 answer
Scrapinghub shub deploy error - Error: Deploy failed (400): project: non_field_errors
When I try to shub deploy it in the cloud and getting the following error.
Error: Deploy failed (400):
project: non_field_errors
My current setup is as follows.
def __init__(self, startUrls, *args, **kwargs):
self.keywords =…

Billy Jhon
- 1,035
- 15
- 30
1
vote
0 answers
OSError: [Errno 1] Operation not permitted: '/System/Library/Frameworks/Python.framework/Versions/2.7/man'
I'm trying to install shub, the Scrapinghub command line tool in OSX 10.11.6 (El Capitan) via pip. The installation script downloads the required modules and at some point returns the following error:
OSError: [Errno 1] Operation not permitted:…

Rell
- 11
- 1
- 3
1
vote
1 answer
Serialize decimals in scrapinghub
I'm following the documentation about serializers in this link, I'm not sure if there's lack of documentation regarding on decimal serializers ?. I defined an Item with a scrapy field like this:
prize = scrapy.Field(serializer=Decimal,…

delpo
- 210
- 2
- 18
1
vote
1 answer
Sequential Order for Item Output | Scrapy
I am using a ScrapingHub API, and am using shub, to deploy my project. However, the items result is in as shown:
Unfortunately, I need it in the following order --> Title, Publish Date, Description, Link. How can I get the output to be in exactly…

Friezan
- 41
- 2
- 7
1
vote
1 answer
Scrapinghub can't connect?
I am trying to simply deploy a Scrapy Spider to ScrapingHub using their rules provided. For some reason, it is searching for a Python 3.6 directory specifically, when it should be able to search for any 3.x Python directory. My spider is written on…

Friezan
- 41
- 2
- 7