Highest Voted 'scrapinghub' Questions

0

votes

0 answers

How to upload file in custom format to AWS S3 bucket using Scrapinghub?

I'm currently building a Python backend that will be deployed item's list to S3 bucket in the format I need using Scrapinghub service and Scrapy module. I successfully log into the website. Made looping through pages and started yielding items in…

asked May 28 '20 at 17:54

IK KLX

121
2
14

0

votes

0 answers

How can I debug this? AttributeError: 'StdoutLogger' object has no attribute 'buffer'

My scrapy code gets this error on Scraping Hub, but not locally. AFAICT I'm using Python 3.8 and Scrapy 2.1.0 in both places. Anybody know what's going on here? I can't find the source to StdoutLogger to try figuring it out. File…

python scrapy scrapinghub

asked May 23 '20 at 18:35

Dogweather

15,512
17
62
81

0

votes

0 answers

Scrapinghub spider finishes and closes before task is done

I am using scrapinhub cloud with a splash instance to scrape content and images from a large list of urls that are provided with the spider. There are around 50 000 urls that I wish to crawl. The first time I ran it, the spider went for just under…

python scrapy scrapinghub

asked Apr 09 '20 at 23:52

BradleyB19

147
5
10

0

votes

1 answer

Selenium Long Page Load in Chrome

I have built a scraper in python 3.6 using selenium and scrapinghub crawlera. I am trying to fetch this car and download its photos. https://www.cars.com/vehicledetail/detail/800885995/overview/ but the page just keep loading for long periods of…

python selenium web-scraping scrapinghub

asked Jan 27 '20 at 15:16

dcarlo56ave

253
5
18

0

votes

2 answers

scrapinghub: Difference DeltaFetch and HTTPCACHE_ENABLED

I struggle to understand the difference between DeltaFetch and HttpCacheMiddleware. Both have the goal that I only scrape pages I haven't requested before?

scrapy scrapinghub

asked Jan 02 '20 at 16:02

Joey Coder

3,199
8
28
60

0

votes

0 answers

How to best deal with Scrapy GeneratorExits and retry failed requests?

Background: I have a Scrapy spider running on Scrapy Cloud using Crawlera for proxies. The website I am trying to crawl is deep in the sense that each page has many "next" pages (i.e., pagination). Sometimes it can be up to 50 pages deep in terms of…

scrapy scrapinghub

asked Dec 17 '19 at 00:08

Keida

23
1
4

0

votes

0 answers

Selenium Desired Capabilites Crashing When Adding Proxy

I have deployed a selenium scraper in scrapinghub cloud using a custom docker container. When running the script I am getting the following error from selenium and I am not sure what is causing this issue when I deploy but not in my local…

python-3.x selenium web-scraping scrapinghub

asked Dec 11 '19 at 03:22

dcarlo56ave

253
5
18

0

votes

0 answers

Web scraping web crawling a pdf document with url that changes on the website with Python

import os import requests from bs4 import BeautifulSoup desktop = os.path.expanduser("~/Desktop") url = 'https://www.ici.org/research/stats' response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') excel_files =…

python web-scraping beautifulsoup web-crawler scrapinghub

asked Oct 03 '19 at 07:31

Dimitra

3
2

0

votes

1 answer

Want to understand Robots.txt

I would like to scrape a website. However I want to make sense of the robots.txt before I do. The lines that I don't understand are User-agent: * Disallow: /*/*/*/*/*/*/*/*/ Disallow: /*?&*&* Disallow: /*?*&* Disallow: /*|* Does the User Agent…

web-scraping scrapy scrapinghub

asked Sep 22 '19 at 14:21

TheGr8Destructo

96
10

0

votes

1 answer

Crawlera, cookies, sessions, rate limiting

I'm trying to use scrapinghub to crawl a website that heavily limits request rate. If I run the spider as-is, I get 429 pretty soon. If I enable crawlera as per standard instructions, the spider doesn't work anymore. If I set headers =…

scrapy scrapinghub crawlera

asked Sep 09 '19 at 12:47

kenshin

197
11

0

votes

1 answer

Python: I am trying to web scrape a page but I am not able to find the html

I am trying to scrape this page (https://www.polarislist.com/) I am trying to pull all of the data such as class size, free/reduced lunch/ student/tacher ratio, % of student demographics by race ,and the respective counts of MIT, Harvard, Princeton…

python html web-scraping beautifulsoup scrapinghub

asked Aug 06 '19 at 16:53

randomDev

31
4

0

votes

0 answers

Is there a way to block certain URLs while a spider is running

I am writing a spider using the scrapy framework (I am using the crawl spider to crawl every link in a domain) to pull certain files from a given domain. I want to block certain URLs where the spider is not finding files. For example, if the sider…

python python-3.x web-scraping scrapy scrapinghub

asked Jul 26 '19 at 18:53

Logan Anderson

564
2
13

0

votes

0 answers

Spider stops after one hour of processing one item

I have a scrapy spider running in a (non-free) scrapinghub account that sometimes has to OCR a PDF (via Tesseract) - which depending on the number of units can take quite some time. What I see in the log is something like this: 2220: 2019-07-07…

scrapy scrapinghub

asked Jul 08 '19 at 12:35

kenshin

197
11

0

votes

1 answer

I cannot figure out how to use a csv file for a list comprehension in scrapinghub deployment

I'm trying to deploy a spider to scrapinghub and cannot figure out how to tackle a data input problem. I need to read IDs from a csv and append them to my start urls as a list comprehension for the spider to crawl: class…

python scrapy scrapinghub

asked Apr 26 '19 at 06:27

mth10

3
2

0

votes

1 answer

Scrapy: ascii' codec can't encode characters

I am having problem on running my crawler UnicodeEncodeError: 'ascii' codec can't encode characters in position I am using this code author = str(info.css(".author::text").extract_first()) but still I am having that error any idea how can solve…

python web-scraping scrapy scrapinghub

asked Apr 25 '19 at 15:08

Christian Read

135
11

Questions tagged [scrapinghub]