Highest Voted 'scrapinghub' Questions

1

vote

2 answers

How to use Crawlera with selenium (Python, Chrome, Windows) without Polipo

So basically i am trying to use the Crawlera Proxy from scrapinghub with selenium chrome on windows using python. I checked the documentation and they suggested using Polipo like this: 1) adding the following lines to /etc/polipo/config parentProxy…

asked Jun 06 '18 at 15:05

Emilz

73
1
8

1

vote

1 answer

Scrapy Prevent Visiting Same URL Across Schedule

I am planning on deploying a Scrapy spider to ScrapingHub and using the schedule feature to run the spider on a daily basis. I know that, by default, Scrapy does not visit the same URLs. However, I was wondering if this duplicate URL avoidance is…

scrapy scrapinghub

asked May 24 '18 at 16:50

Marcus Christiansen

187
2
12

1

vote

0 answers

Scraping Hub Periodic Script / IOError No such file or directory

I am trying to run a periodic script and connect it with a json file within my project. I tried this (https://support.scrapinghub.com/support/solutions/articles/22000200416-deploying-non-code-files) but this is not working for me, structure imported…

python scrapy scrapinghub

asked May 20 '18 at 17:47

nicolasdavid

2,821
4
18
22

1

vote

0 answers

Modules folder in Scrapinghub

I'm currently using Scrapinghub's Scrapy Cloud to host my 12 spiders (and 12 differnet projects). I'd like to have one folder with functions that are used by all 12 spiders but not sure what the best way to implement it without having 1 functions…

scrapy scrapyd scrapinghub

asked Apr 02 '18 at 11:35

Axel Eriksson

105
1
11

1

vote

2 answers

Crawlera: 407 "Bad Auth" error message

Using Crawlera's sample code for a GET request with a proxy. import requests url = "http://httpbin.org/ip" proxy_host = "proxy.crawlera.com" proxy_port = "8010" proxy_auth = ":" # Make sure to include ':' at the end proxies = { …

python python-requests scrapinghub

asked Mar 26 '18 at 11:28

Joseph D.

11,804
3
34
67

1

vote

1 answer

Deploy failed because multiple spiders with Scrapinghub

I create a project with scrapy and save data to my mongodb. It can work. Here is my code: # -*- coding: utf-8 -*- import scrapy from scrapy import Request import time # scrapy api imports from scrapy.crawler import CrawlerProcess from…

python scrapy scrapinghub

asked Mar 17 '18 at 11:46

Morton

5,380
18
63
118

1

vote

1 answer

How to let Scrapy access Tor after deploy to Scapinghub

I had configure the spider to access the Tor with setup Privoxy but this only work when I use in localhost as the setting I configure is pointed to 127.0.0.1: port. But when i deploy to the Scapinghub, the server side do not setup tor and privoxy as…

ip scrapy tor scrapinghub privoxy

asked Mar 05 '18 at 15:44

Terence Goh

13
1
4

1

vote

1 answer

Ignoring requests while scraping two pages

I am now scraping this website on a daily basis, and am using DeltaFetch to ignore pages which have already been visited (a lot of them). The issue I am facing is that for this website, I need to first scrape page A, and then scrape page B to…

python scrapy scrapinghub

asked Mar 01 '18 at 22:13

Abel Riboulot

158
1
8

1

vote

1 answer

Scrapy, Scrapinghub and Google Cloud Storage: Keyerror 'gs' while running the spider on scrapinghub

I'm working on a scrapy project using Python 3 and the spiders are deployed to scrapinghub. I'm also using Google Cloud Storage to store the scraped files as mentioned in the official doc here. The spiders are running absolutely fine when i'm…

python-3.x scrapy google-cloud-platform google-cloud-storage scrapinghub

asked Feb 22 '18 at 10:35

Sagar Singh Verma

55
11

1

vote

2 answers

Set variable on shub deploy project

I'm trying to setup scrapy settings to work with test and production environment on local and also on scrapinghub. And I would like to know if there is any way to set this variable (for example as the following) on shub deploy: And then at…

python deployment scrapy environment scrapinghub

asked Nov 14 '17 at 16:21

Alberto

1,423
18
32

1

vote

1 answer

Scrapinghub shub deploy error - Error: Deploy failed (400): project: non_field_errors

When I try to shub deploy it in the cloud and getting the following error. Error: Deploy failed (400): project: non_field_errors My current setup is as follows. def __init__(self, startUrls, *args, **kwargs): self.keywords =…

python-2.7 scrapy scrapinghub

asked Nov 05 '17 at 10:34

Billy Jhon

1,035
15
30

1

vote

0 answers

OSError: [Errno 1] Operation not permitted: '/System/Library/Frameworks/Python.framework/Versions/2.7/man'

I'm trying to install shub, the Scrapinghub command line tool in OSX 10.11.6 (El Capitan) via pip. The installation script downloads the required modules and at some point returns the following error: OSError: [Errno 1] Operation not permitted:…

python macos python-2.7 scrapy scrapinghub

asked Oct 13 '17 at 21:13

Rell

11
1
3

1

vote

1 answer

Serialize decimals in scrapinghub

I'm following the documentation about serializers in this link, I'm not sure if there's lack of documentation regarding on decimal serializers ?. I defined an Item with a scrapy field like this: prize = scrapy.Field(serializer=Decimal,…

scrapy scrapinghub

asked Sep 12 '17 at 23:01

delpo

210
2
18

1

vote

1 answer

Sequential Order for Item Output | Scrapy

I am using a ScrapingHub API, and am using shub, to deploy my project. However, the items result is in as shown: Unfortunately, I need it in the following order --> Title, Publish Date, Description, Link. How can I get the output to be in exactly…

scrapy web-crawler scrapinghub

asked Jun 19 '17 at 18:06

Friezan

41
2
7

1

vote

1 answer

Scrapinghub can't connect?

I am trying to simply deploy a Scrapy Spider to ScrapingHub using their rules provided. For some reason, it is searching for a Python 3.6 directory specifically, when it should be able to search for any 3.x Python directory. My spider is written on…

deployment scrapy scrapinghub

asked Jun 15 '17 at 14:12

Friezan

41
2
7

Questions tagged [scrapinghub]