Questions tagged [scrapy-pipeline]

218 questions
0
votes
1 answer

python scrapy pipeline suddenly doesn't work

It is very weird, I wrote the scrapy code with its pipeline and crawled huge amount of data, it always worked well. Today when i re-run the same code, it suddenly doesn't work at all. Here are the details: My Spider - base_url_spider.py import…
Cherry Wu
  • 3,844
  • 9
  • 43
  • 63
0
votes
1 answer

Scrapy Pipeline unknown number of results

I have a scrapy spider which gets the start_urls from a MySQL database. When it scrapes each page it comes back with an unknown number of links, meaning it could have zero links or up to 10 links from each page that it scrapes. Because that number…
SDailey
  • 17
  • 3
0
votes
1 answer

Scrapy Pipeline SQL Syntax error

I have a spider that grabs URL's from a MySQL DB and uses those URL's as the start_urls to scrape, which in turn grabs any number of new links from the scraped pages. When I set the pipeline to INSERT both the start_url and new scraped url to a new…
SDailey
  • 17
  • 3
0
votes
1 answer

How can I check if Scrapy Image Pipeline is using a proxy to download images?

I have built a scraper and would like to download some images using a proxy in scrapy. I don't know if it is really downloading through the proxy. Reponse Headers don't show the IP. Furthermore, if I change the IP to a random IP, it still downloads…
zer02
  • 3,963
  • 4
  • 31
  • 66
0
votes
1 answer

Scrapy Regex Custom Pipeline

This is my Scrapy custom regex pipeline code: for p in item['code']: for search_type, pattern in RegEx.regexp.iteritems(): s = re.findall(pattern, p) if s: return item else: …
Stuart
  • 11
0
votes
0 answers

How to download images from dynamically generated hashed url using scrapy?

I am using scrapy to download images from website https://pixabay.com/. My working code is as below- from scrapy.spiders import Spider from scrapy.selector import Selector from scrapy.http import Request from website.imageItems import…
Bit_hunter
  • 789
  • 2
  • 8
  • 25
0
votes
2 answers

Retreive http return code from ImagesPipeline (or MediaPipeline) in scrapy

I have a working spider scraping image URLs and placing them in image_urls field of a scrapy.Item. I have a custom pipeline that inherits from ImagesPipeline. When a specific URL returns a non-200 http response code (like say a 401 error). For…
hAcKnRoCk
  • 1,118
  • 3
  • 16
  • 30
0
votes
2 answers

Crawl website from list of values using scrapy

I have a list of NPIs which I want to scrape the names of the providers for from npidb.org The NPI values are stored in a csv file. I am able to do it manually by pasting the URLs in the code. However, I am unable to figure out how to do it if I…
0
votes
0 answers

Scrapy - Pipe data to database if keyword match found

Put simply, I'm scraping web data in Scrapy. I need to analyse the scraped data for keywords / regex and if matched, pipeline the data to database. If not found, drop. My question is: should/can I do this from within Scrapy and if so do you have…
Stuart
  • 11
0
votes
1 answer

Scrapy Get returned Value from pipeline

I'm trying to get returned value from pipeline. I'm using yield generator to generate item. And this is my code. def get_or_create(model): model_class = type(model) created = False try: obj =…
Murat Kaya
  • 1,281
  • 3
  • 28
  • 52
0
votes
1 answer

Below POST Method is not working in scrapy

I have tried with headers, cookies, Formdata and body too, but i got 401 and 500 status code. In this site First Page is in GET method & gives HTML response and further pages are in POST method & gives JSON response. But these status codes arrives…
Vimal Annamalai
  • 139
  • 1
  • 2
  • 12
0
votes
1 answer

Scrapy Only Cache Images

I thought i found a solution using RFC2616 policy but in testing the scraper execution time it seems to still say the same. So i went back to the Default Policy. I'm directing my image_urls to 'production.pipelines.MyImagesPipeline' Now i only…
Kevin G
  • 2,325
  • 3
  • 16
  • 30
0
votes
2 answers

Can't get value from Scrapy stats dictionary

I have this pipeline in my scrapy where I need to get an info from the Scrapy stats class MyPipeline(object): def __init__(self, stats): self.stats = stats @classmethod def from_crawler(cls, crawler): return…
Aminah Nuraini
  • 18,120
  • 8
  • 90
  • 108
0
votes
0 answers

Debugging Scrapy Item pipeline

I am trying to persist the scraped items into MySQL running on localhost. Even tough the spider crawls the sites and scrapes items in the intended way, my pipeline object for persisting does not work - it does not store items into the…
zmg
  • 1
  • 4
0
votes
2 answers

Use scrapy as an item generator

I have an existing script (main.py) that requires data to be scraped. I started a scrapy project for retrieving this data. Now, is there any way main.py can retrieve the data from scrapy as an Item generator, rather than persisting data using the…
bsuire
  • 1,383
  • 2
  • 18
  • 27