Questions tagged [scrapinghub]

a web scraping development and services company, supplies cloud-based web crawling platforms.

179 questions
0
votes
1 answer

Getting spider on Scrapy Cloud to store files on Google Cloud Storage using GCSFilesStore and getting ImportError

Deploying a spider to Scraping Cloud. It gathers download links for files and should save those files in a Google Cloud bucket. It works when running locally. But when deploying to Scraping Hub it return the following errors: Traceback (most recent…
markkazanski
  • 439
  • 7
  • 20
0
votes
1 answer

How to scrape data on website if using Javascript with pagination

I have a website that's need to scrape the data "https://www.forever21.com/us/shop/catalog/category/f21/sale#pageno=1&pageSize=120&filter=price:0,250&sort=5" but I cannot retrieve all the data it also has pagination and Its uses javascript as well.…
0
votes
2 answers

ScrapingHub: No Module named Mysql.connector

In my local machine everything works fine, but when I deployed it on ScrapingHub I've got an error saying all "ImportError: No module named mysql.connector". All I need is to, whenever I run my spider or run through job schedule it will…
0
votes
1 answer

Scrapinghub exporting multiple items

In scrapinghub how can we achieve multiple items exporting? I have MainItem() and a SubItem() item classes and I would like to get two separate items in scrapinghub item's page. I can do this by implementing different item pipelines for both …
Jithin
  • 1,692
  • 17
  • 25
0
votes
2 answers

Distributed communication between Scrapy spiders

I want to run two spiders in a coordinated fashion. First spider will scrape some website and produce URLs and the second one will consume these addresses. I can't wait for the first spider to finish and then launch the second one since the website…
Bociek
  • 1,195
  • 2
  • 13
  • 28
0
votes
0 answers

How to "add" geckodriver to PATH on ScrapingHub?

I am using python2 for web scraping, I have written a spider that uses headless Firefox (no GUI) to go on a website, log in with my account and furthermore interact with the website by pressing buttons, filling forms, calendars, etc. It works as…
0
votes
0 answers

Retrieve all items from Scrapinghub as hash

I retrieved all items from a job in Scrapinghub: url = "https://storage.scrapinghub.com/items/#{job_id}?apikey=#{API_KEY}" response = HTTParty.get(url) items = response.parsed_response The problem is that items is a String instead of a Hash. Is…
abc03
  • 13
  • 3
0
votes
1 answer

Adding meta deltafetch_key for every request in SitemapSpider and CrawlSpider

I'm using scrapinghub's deltafetch feature in order to get new pages from a website, without requesting the urls I have already scraped. I've noticed that on some websites, scrapy would still scrap pages with an already visited url. I had to replace…
romain-lavoix
  • 403
  • 2
  • 6
  • 20
0
votes
1 answer

How can I transform a value after it was extracted?

I am using Portia to extract info from a page. However, one of the values extracted is not in a format that I can use. More specifically, I want to extract a numeric value which uses a dot instead of a comma to denote thousands e.g. "1.000" instead…
George Eracleous
  • 4,278
  • 6
  • 41
  • 50
0
votes
2 answers

From local scrapy to scrapy cloud (scraping hub) - Unexpected results

The scraper I deployed on Scrapy cloud is producing an unexpected result compared to the local version. My local version can easily extract every field of a product item (from an online retailer) but on the scrapy cloud, the field "ingredients"…
BoobaGump
  • 525
  • 1
  • 6
  • 17
0
votes
0 answers

Scrapy 0 pages crawled but no visible issue?

I used Portia to create a spider and then downloaded it as scrapy project. The spider runs fine but it says in the logs: Scrapy Crawled 0 pages (at 0 pages/min) and also nothing get's saved. However, it also shows all the pages crawled with 200…
0
votes
1 answer

What does Scrapy Job Setting mean?

I was reading https://doc.scrapinghub.com/scrapy-cloud.html#scrapycloud , and kind of confused what does it mean to override a Scrapy settings for a job. Does it mean that I can change the start_url? Or which setting that I can really override.…
0
votes
0 answers

"Scrapy crawl " not working from the project folder running from spiders folder

I am newbie in Python. I have tried finding the solution everywhere but couldn't get through. I have made a Scrapy project and because of the project structure, the spiders are by default stored in /spiders directory. Problem: We generally run the…
0
votes
1 answer

How can deltafetch & splash be used together in Scrapy (python)

I am trying to build a scraper using scrapy and I plan to use deltafetch to enable incremental refresh but I need to parse javascript based pages which is why I need to use splash as well. In the settings.py file, we need to add SPIDER_MIDDLEWARES =…
Aayush Agrawal
  • 184
  • 1
  • 6
0
votes
2 answers

Automatically Parse a Website

I have an idea and want to see whether it is possible to implement. I want to parse a website (copart.com) that shows, daily, a different and large list of cars with the corresponding description for each car. Daily, I am tasked with going over each…