a web scraping development and services company, supplies cloud-based web crawling platforms.
Questions tagged [scrapinghub]
179 questions
0
votes
1 answer
Getting spider on Scrapy Cloud to store files on Google Cloud Storage using GCSFilesStore and getting ImportError
Deploying a spider to Scraping Cloud. It gathers download links for files and should save those files in a Google Cloud bucket. It works when running locally. But when deploying to Scraping Hub it return the following errors:
Traceback (most recent…

markkazanski
- 439
- 7
- 20
0
votes
1 answer
How to scrape data on website if using Javascript with pagination
I have a website that's need to scrape the data
"https://www.forever21.com/us/shop/catalog/category/f21/sale#pageno=1&pageSize=120&filter=price:0,250&sort=5" but I cannot retrieve all the data it also has pagination and Its uses javascript as well.…

Christian Read
- 135
- 11
0
votes
2 answers
ScrapingHub: No Module named Mysql.connector
In my local machine everything works fine, but when I deployed it on ScrapingHub I've got an error saying all "ImportError: No module named mysql.connector".
All I need is to, whenever I run my spider or run through job schedule it will…

Christian Read
- 135
- 11
0
votes
1 answer
Scrapinghub exporting multiple items
In scrapinghub how can we achieve multiple items exporting?
I have MainItem() and a SubItem() item classes and I would like to get two separate items in scrapinghub item's page.
I can do this by implementing different item pipelines for both
…

Jithin
- 1,692
- 17
- 25
0
votes
2 answers
Distributed communication between Scrapy spiders
I want to run two spiders in a coordinated fashion. First spider will scrape some website and produce URLs and the second one will consume these addresses. I can't wait for the first spider to finish and then launch the second one since the website…

Bociek
- 1,195
- 2
- 13
- 28
0
votes
0 answers
How to "add" geckodriver to PATH on ScrapingHub?
I am using python2 for web scraping, I have written a spider that uses headless Firefox (no GUI) to go on a website, log in with my account and furthermore interact with the website by pressing buttons, filling forms, calendars, etc. It works as…

Luis Viguria
- 1
- 1
0
votes
0 answers
Retrieve all items from Scrapinghub as hash
I retrieved all items from a job in Scrapinghub:
url = "https://storage.scrapinghub.com/items/#{job_id}?apikey=#{API_KEY}"
response = HTTParty.get(url)
items = response.parsed_response
The problem is that items is a String instead of a Hash. Is…

abc03
- 13
- 3
0
votes
1 answer
Adding meta deltafetch_key for every request in SitemapSpider and CrawlSpider
I'm using scrapinghub's deltafetch feature in order to get new pages from a website, without requesting the urls I have already scraped.
I've noticed that on some websites, scrapy would still scrap pages with an already visited url. I had to replace…

romain-lavoix
- 403
- 2
- 6
- 20
0
votes
1 answer
How can I transform a value after it was extracted?
I am using Portia to extract info from a page. However, one of the values extracted is not in a format that I can use.
More specifically, I want to extract a numeric value which uses a dot instead of a comma to denote thousands e.g. "1.000" instead…

George Eracleous
- 4,278
- 6
- 41
- 50
0
votes
2 answers
From local scrapy to scrapy cloud (scraping hub) - Unexpected results
The scraper I deployed on Scrapy cloud is producing an unexpected result compared to the local version.
My local version can easily extract every field of a product item (from an online retailer) but on the scrapy cloud, the field "ingredients"…

BoobaGump
- 525
- 1
- 6
- 17
0
votes
0 answers
Scrapy 0 pages crawled but no visible issue?
I used Portia to create a spider and then downloaded it as scrapy project. The spider runs fine but it says in the logs: Scrapy Crawled 0 pages (at 0 pages/min) and also nothing get's saved. However, it also shows all the pages crawled with 200…

SimpleCoder
- 67
- 7
0
votes
1 answer
What does Scrapy Job Setting mean?
I was reading https://doc.scrapinghub.com/scrapy-cloud.html#scrapycloud , and kind of confused what does it mean to override a Scrapy settings for a job. Does it mean that I can change the start_url? Or which setting that I can really override.…

Andre Rumapea
- 3
- 5
0
votes
0 answers
"Scrapy crawl " not working from the project folder running from spiders folder
I am newbie in Python. I have tried finding the solution everywhere but couldn't get through.
I have made a Scrapy project and because of the project structure, the spiders are by default stored in /spiders directory.
Problem: We generally run the…

Sankalp Nigam
- 37
- 1
- 1
- 7
0
votes
1 answer
How can deltafetch & splash be used together in Scrapy (python)
I am trying to build a scraper using scrapy and I plan to use deltafetch to enable incremental refresh but I need to parse javascript based pages which is why I need to use splash as well.
In the settings.py file, we need to add
SPIDER_MIDDLEWARES =…

Aayush Agrawal
- 184
- 1
- 6
0
votes
2 answers
Automatically Parse a Website
I have an idea and want to see whether it is possible to implement. I want to parse a website (copart.com) that shows, daily, a different and large list of cars with the corresponding description for each car. Daily, I am tasked with going over each…

Geek96
- 57
- 8