Questions tagged [scrapinghub]

a web scraping development and services company, supplies cloud-based web crawling platforms.

179 questions
0
votes
0 answers

What is the correct way to write a file on Scrapinghub?

I use Python-Scrapy and Scrapinghub. In my spider I should read and write a file data_directory = 'tmp' csv_magasin = data_directory+"/"+current_script+"_"+current_date+"-shop_url.csv" if not os.path.exists(data_directory): …
parik
  • 2,313
  • 12
  • 39
  • 67
0
votes
0 answers

Scrapinghub Crawled 0 pages (at 0 pages/min)

I have developed a simple scrapy project to crawl a website. The crawler works fine on my local machine but when I try to deploy it to Scrapy cloud provided by scrapinghub.com the spider shows 0 pages crawled and after 180 sec (default timeout) it…
shubham003
  • 703
  • 2
  • 9
  • 20
0
votes
1 answer

Scrapy how to save a State between spider runs (via scrapinghub)?

I have a spider that will run on schedule. Spider input is based on Date. From date of last scrape to todays date. So the question is how to save the date of last scrape within the Scrapy project? There is an option to get data from scrapy settings…
Billy Jhon
  • 1,035
  • 15
  • 30
0
votes
0 answers

Error caught on signal handler:TypeError: to_bytes must receive a unicode got instance

Getting this strange error when I run my code in scrapy cloud. Not sure how to debug it. There is no reference to line in the spider code. I assume it is about saving an item and smth general as no url is indicated. Moreover the spider runs ok and…
Billy Jhon
  • 1,035
  • 15
  • 30
0
votes
1 answer

Dependency error while trying to run project on Scrapy Cloud

I create a project with scrapy and using pymongo save my data to mongodb. I have checked my pymongo version is 3.5.1 When i deploy my project to scrapinghub and run it. It shows error on scrapinghub exceptions.ImportError: No module named pymongo I…
Morton
  • 5,380
  • 18
  • 63
  • 118
0
votes
2 answers

Cant install MySQLdb-python==1.2.5 Scrapinghub (Scrapy) Python 2.7

I read some threads about connecting Mysql with scrapinghub deployed script. They reccomend to change *.yml file and add requirements txt. This solution worked few days ago. Now it doesnt. Here is error from Shub Deploy. Collecting…
Billy Jhon
  • 1,035
  • 15
  • 30
0
votes
1 answer

Update start urls at scrapinghub hosted Scrapy project via API call

My Scrapy spider is hosted at scrapinghub. It is managed via run spider API call. The only thing that changes in spider from call to call is a list of start urls. The list may vary from 100 urls to couple thousand. What is the best way to update…
Billy Jhon
  • 1,035
  • 15
  • 30
0
votes
1 answer

text substitution {} does not work at scrapinghub

I create a url with {} format to change the url on the fly. It works totally fine on my PC. But once I upload and run it from scrapinghub one(state) of the many substitutions(others work fine) does not work, it returns %7B%7D& in the url which is…
Billy Jhon
  • 1,035
  • 15
  • 30
0
votes
1 answer

How to use pip to install middleware on Scrapinghub

I have a scrapy project that use middleware install via pip. More specifically scrapy-random-useragent. Setting file # -- coding: utf-8 -- # Scrapy settings for batdongsan project # # For simplicity, this file contains only settings considered…
Haha TTpro
  • 5,137
  • 6
  • 45
  • 71
0
votes
0 answers

Cannot import ScrapinghubClient

>>> from scrapinghub import ScrapinghubClient Traceback (most recent call last): File "", line 1, in ImportError: cannot import name ScrapinghubClient Why is this happening? I have Python 2.7.13 |Continuum Analytics, Inc.| (default, May 11…
0
votes
1 answer

Scrapinghub job failed - can't diagnose

The spider stopped in the middle of the crawl (after 7h run, 20K requests). The job status is "failure". Even though there are no ERROR messages in the log. The log look like the code just stopped running on a particular code line range without any…
noname7619
  • 3,370
  • 3
  • 21
  • 26
0
votes
1 answer

Scrapy: Redirecting to a confirmation page with a captcha

How can I stop redirecting from a target url to another url which is a confirmation page of a website with a captcha? Here is my code below: yield scrapy.Request(meta={'handle_httpstatus_list': [302], 'dont_redirect': True,…
RF_956
  • 329
  • 2
  • 7
  • 18
0
votes
1 answer

ScrapingHub: ImportError: No module named firebase

I'm trying to put my scraped data on my firebase account on cloud , but i'm getting this ImportError when i run the spider. I tried making new project and even reinstalling the firebase and shub on specific version of Python but no help. the spider…
P.hunter
  • 1,345
  • 2
  • 21
  • 45
0
votes
1 answer

How to extract files from ScrapingHub?

I have deployed some Scrapy spiders to scrape data which I can download in .csv from ScrapingHub. Some of these spiders have FilePipeline which I used to download files (pdf) to a specific folder. Is there any way I can retrieve these files from…
graph
  • 77
  • 4
0
votes
0 answers

scrapy script stops after certain requests

I have a scrapy script running on scrapinghub. The scraper takes one argument as a csv file where the urls have been stored. The script runs without error, but the problem is that it isn't scraping all the items from the url. I have no idea why this…