Questions tagged [scrapy-pipeline]

218 questions
0
votes
1 answer

My Scrapy item['img_urls'] doesn't download the file

I'm currently working on a student's data scientist project which consist of building a fish recognition system by picture. We will use tensorflow to make sense from data & scrapy to find a massive amount of data (fish picture & his scientific…
0
votes
0 answers

Scrapy best practice: Connect to database in crawler or in pipeline?

I am scraping a main page that has a list of items. Within my pipeline I connect to a database to store the items. My next task is to go to each individual item page and scrape comments. I need to connect to the database again to see if I've already…
Learning C
  • 679
  • 10
  • 27
0
votes
1 answer

Why my pipeline return previous modified items?

I created a pipeline to save each item on ElasticSearch. On this pipeline I check if item already exist to check if administrator override some field, to force a reindex (got this field and save/override it on new item) class…
0
votes
1 answer

Scrapy Image Pipeline: How to rename images?

I've a spider which fetches both the data and images. I want to rename the images with the respective 'title' which i'm fetching. Following is my code: spider1.py from imageToFileSystemCheck.items import ImagetofilesystemcheckItem import…
0
votes
0 answers

scrapy - download image without compressing the picture

I am trying to download some images without compression. e.g. http://p1.pstatp.com/origin/433c000159def0223671 this pic is about 2.0MB when i download it using scrapy it's only 120Kb . settings.py BOT_NAME = 'toutiao' SPIDER_MODULES =…
咸蛋超人
  • 45
  • 1
  • 6
0
votes
2 answers

Scraping multiple tables and storing each table header as rows in csv

I'm trying to scrape multiple tables which have a table name stored under a h3 tag. There is Columns of data I can scrape no problem and when I feed the next url I can append this data to the csv file. The problem I can't solve is to get the table…
tomoc4
  • 337
  • 2
  • 10
  • 29
0
votes
0 answers

Scrapy log HTTP errors to database or pipeline

I'm trying to get a full picture of my crawls in a database (mySQL). So I need any errback stuff to get logged to the database. Is it possible to pass errback to the pipelines? I currently have it set up like so: Reponse -> (Item) -> Pipeline When…
Akustik
  • 43
  • 8
0
votes
1 answer

How to scrape tens of thousands urls every night using scrapy

I am using scrapy to scrape some big brands to import the sale data for my site. Currently I am using DOWNLOAD_DELAY = 1.5 CONCURRENT_REQUESTS_PER_DOMAIN = 16 CONCURRENT_REQUESTS_PER_IP = 16 I am using Item loader to specify css/xpath rules and…
mmrs151
  • 3,924
  • 2
  • 34
  • 38
0
votes
0 answers

skip downloading but not other tasks in scrapy pipeline

Is there a way I can skip downloading a webpage but still have other parts of pipeline after it execute? Currently, I read a file of json objects in start_requests, each json object has a website URL and other data fields. if a website URL is not…
comiventor
  • 3,922
  • 5
  • 50
  • 77
0
votes
0 answers

Scrapy data not being written to database

The spider and pipeline are running fine but the database still shows empty set. Here is the pipeline code. I am using python 2.7 and mysql database from twisted.enterprise import adbapi class MysqlWriter(object): def __init__(self): …
0
votes
2 answers

Scrapy and python Reponse object has no attribute 'xpath'

EDIT 2 - Because my folders got mixed up with names I chose, I accidentally posted the wrong code. Please see below for accurate code of each file for the correct folder containing all my files for this. Settings # -*- coding: utf-8 -*- # Scrapy…
mlclm
  • 725
  • 6
  • 16
  • 38
0
votes
2 answers

Scrapy and celery `update_state`

I have the following setup (Docker): Celery linked to Flask setup which runs the Scrapy spider Flask setup (obviously) Flask setup gets request for Scrapy -> fire up worker to do some work Now I wish to update the original flask setup on the…
WiseStrawberry
  • 317
  • 1
  • 4
  • 14
0
votes
1 answer

Python + Scrapy renaming downloaded images

IMPORTANT NOTE: all the answers available at the moment on stackoverflow are for previous versions of Scrapy and don't work with the latest version of scrapy 1.4 Totally new to scrapy and python, I am trying to scrape some pages and download the…
mlclm
  • 725
  • 6
  • 16
  • 38
0
votes
1 answer

Scrapy merge output on a field

I have a Scrapy output like this: [{'gender': 'women', 'name': 'NEW IN: CLOTHING', 'products': [{'name': 'Free People Cocoon Multi Way Neck Top', 'price': {'currency': 'GBP', 'outlet': '40.0', …
0
votes
1 answer

Data crawling using scrapy package in python

I'm trying to get some data with images from website(IMDB) using 'scrapy' package. If there is a image_URL in div class, then i'm able to crawl data with movie poster. However, If not, my code doesn't work properly. It skipped some data associate…