a web scraping development and services company, supplies cloud-based web crawling platforms.
Questions tagged [scrapinghub]
179 questions
0
votes
0 answers
How to upload file in custom format to AWS S3 bucket using Scrapinghub?
I'm currently building a Python backend that will be deployed item's list to S3 bucket in the format I need using Scrapinghub service and Scrapy module.
I successfully log into the website. Made looping through pages and started yielding items in…

IK KLX
- 121
- 2
- 14
0
votes
0 answers
How can I debug this? AttributeError: 'StdoutLogger' object has no attribute 'buffer'
My scrapy code gets this error on Scraping Hub, but not locally. AFAICT I'm using Python 3.8 and Scrapy 2.1.0 in both places.
Anybody know what's going on here? I can't find the source to StdoutLogger to try figuring it out.
File…

Dogweather
- 15,512
- 17
- 62
- 81
0
votes
0 answers
Scrapinghub spider finishes and closes before task is done
I am using scrapinhub cloud with a splash instance to scrape content and images from a large list of urls that are provided with the spider. There are around 50 000 urls that I wish to crawl.
The first time I ran it, the spider went for just under…

BradleyB19
- 147
- 5
- 10
0
votes
1 answer
Selenium Long Page Load in Chrome
I have built a scraper in python 3.6 using selenium and scrapinghub crawlera. I am trying to fetch this car and download its photos. https://www.cars.com/vehicledetail/detail/800885995/overview/ but the page just keep loading for long periods of…

dcarlo56ave
- 253
- 5
- 18
0
votes
2 answers
scrapinghub: Difference DeltaFetch and HTTPCACHE_ENABLED
I struggle to understand the difference between DeltaFetch and HttpCacheMiddleware. Both have the goal that I only scrape pages I haven't requested before?

Joey Coder
- 3,199
- 8
- 28
- 60
0
votes
0 answers
How to best deal with Scrapy GeneratorExits and retry failed requests?
Background:
I have a Scrapy spider running on Scrapy Cloud using Crawlera for proxies. The website I am trying to crawl is deep in the sense that each page has many "next" pages (i.e., pagination). Sometimes it can be up to 50 pages deep in terms of…

Keida
- 23
- 1
- 4
0
votes
0 answers
Selenium Desired Capabilites Crashing When Adding Proxy
I have deployed a selenium scraper in scrapinghub cloud using a custom docker container. When running the script I am getting the following error from selenium and I am not sure what is causing this issue when I deploy but not in my local…

dcarlo56ave
- 253
- 5
- 18
0
votes
0 answers
Web scraping web crawling a pdf document with url that changes on the website with Python
import os
import requests
from bs4 import BeautifulSoup
desktop = os.path.expanduser("~/Desktop")
url = 'https://www.ici.org/research/stats'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
excel_files =…

Dimitra
- 3
- 2
0
votes
1 answer
Want to understand Robots.txt
I would like to scrape a website. However I want to make sense of the robots.txt before I do.
The lines that I don't understand are
User-agent: *
Disallow: /*/*/*/*/*/*/*/*/
Disallow: /*?&*&*
Disallow: /*?*&*
Disallow: /*|*
Does the User Agent…

TheGr8Destructo
- 96
- 10
0
votes
1 answer
Crawlera, cookies, sessions, rate limiting
I'm trying to use scrapinghub to crawl a website that heavily limits request rate.
If I run the spider as-is, I get 429 pretty soon.
If I enable crawlera as per standard instructions, the spider doesn't work anymore.
If I set headers =…

kenshin
- 197
- 11
0
votes
1 answer
Python: I am trying to web scrape a page but I am not able to find the html
I am trying to scrape this page (https://www.polarislist.com/)
I am trying to pull all of the data such as class size, free/reduced lunch/ student/tacher ratio, % of student demographics by race ,and the respective counts of MIT, Harvard, Princeton…

randomDev
- 31
- 4
0
votes
0 answers
Is there a way to block certain URLs while a spider is running
I am writing a spider using the scrapy framework (I am using the crawl spider to crawl every link in a domain) to pull certain files from a given domain. I want to block certain URLs where the spider is not finding files. For example, if the sider…

Logan Anderson
- 564
- 2
- 13
0
votes
0 answers
Spider stops after one hour of processing one item
I have a scrapy spider running in a (non-free) scrapinghub account that sometimes has to OCR a PDF (via Tesseract) - which depending on the number of units can take quite some time.
What I see in the log is something like this:
2220: 2019-07-07…

kenshin
- 197
- 11
0
votes
1 answer
I cannot figure out how to use a csv file for a list comprehension in scrapinghub deployment
I'm trying to deploy a spider to scrapinghub and cannot figure out how to tackle a data input problem. I need to read IDs from a csv and append them to my start urls as a list comprehension for the spider to crawl:
class…

mth10
- 3
- 2
0
votes
1 answer
Scrapy: ascii' codec can't encode characters
I am having problem on running my crawler
UnicodeEncodeError: 'ascii' codec can't encode characters in position
I am using this code
author = str(info.css(".author::text").extract_first())
but still I am having that error any idea how can solve…

Christian Read
- 135
- 11