Scrapy: how to send the items to the site via the api

Question

Now my spiders are sending data to my site in this way:

def parse_product(response, **cb_kwargs):
    item = {}
    item[url] = response.url
    data = {
        "source_id": 505,
        "token": f"{API_TOKEN}",
        "products": [item]
         }
    headers = {'Content-Type': 'application/json'}
    url = 'http://some.site.com/api/'
    requests.post(url=url, headers=headers, data=json.dumps(data))

is it possible to somehow implement this design through a pipeline or middleware, because it is inconvenient to prescribe for each spider?

p.s. the data (data) needs to be sent in the json format (json.dumps(data)), if I make the item = MyItemClass() class, an error occurs...

score 3 · Accepted Answer · answered Aug 22 '22 at 04:26

3

It can be done using a pipeline fairly easily. You can also use scrapy's Item class and item Field class as long as you cast them to a dict prior to calling json.dumps.

For Example:

class Pipeline:

    def process_item(self, item, spider):
        data = dict(item)
        headers = {'Content-Type': 'application/json'}
        url = 'http://some.site.com/api/'
        requests.post(url=url, headers=headers, data=json.dumps(data))
        return item

If you use this example it will call it on each and every item you yield from your spider. Just remember to activate it in your settings.py file.

answered Aug 22 '22 at 04:26

Alexander

16,091
5
13
29

indeed, it seems to work: `data = { "source_id": "name", "token": "key", "products": [dict(item)] }` and also, is it possible in `dict(item)` transfer not one item at a time, but for example 10 (a list of dictionaries)? and is it possible in the key `source_id` pass the name of the spider? – m_sasha Aug 22 '22 at 19:20
1

@m_sasha On the list of dictionaries question I'm sure it is possible, but Im not sure on how, since the pipeline processes each item individually. as they are processed by the execution engine, I suggest opening another question for that one because it's a good question. As for the source_id, you can pass anything you want, you just need to add it when you construct your item in your spiders parse method. Or I suppose you could just call `dict.update` inside the pipeline as well. – Alexander Aug 22 '22 at 19:28

score 2 · Answer 2 · answered Aug 27 '22 at 10:54

I found another additional solution (on github), maybe someone will be interested...

pipeline.py

import json
import logging

import requests
from scrapy.utils.serialize import ScrapyJSONEncoder
from twisted.internet.defer import DeferredLock
from twisted.internet.threads import deferToThread

default_serialize = ScrapyJSONEncoder().encode


class HttpPostPipeline(object):
    settings = None
    items_buffer = []

    DEFAULT_HTTP_POST_PIPELINE_BUFFERED = False
    DEFAULT_HTTP_POST_PIPELINE_BUFFER_SIZE = 100

def __init__(self, url, headers=None, serialize_func=default_serialize):
    """Initialize pipeline.
    Parameters
    ----------
    url : StrictRedis
        Redis client instance.
    serialize_func : callable
        Items serializer function.
    """
    self.url = url
    self.headers = headers if headers else {}
    self.serialize_func = serialize_func
    self._lock = DeferredLock()

@classmethod
def from_crawler(cls, crawler):
    params = {
        'url': crawler.settings.get('HTTP_POST_PIPELINE_URL'),
    }
    if crawler.settings.get('HTTP_POST_PIPELINE_HEADERS'):
        params['headers'] = crawler.settings['HTTP_POST_PIPELINE_HEADERS']

    ext = cls(**params)
    ext.settings = crawler.settings

    return ext

def process_item(self, item, spider):
    if self.settings.get('HTTP_POST_PIPELINE_BUFFERED', self.DEFAULT_HTTP_POST_PIPELINE_BUFFERED):
        self._lock.run(self._process_items, item)
        return item
    else:
        return deferToThread(self._process_item, item, spider)

def _process_item(self, item, spider):
    data = self.serialize_func(item)
    requests.post(self.url, json=json.loads(data), headers=self.headers)
    return item

def _process_items(self, item):
    self.items_buffer.append(item)
    if len(self.items_buffer) >= int(self.settings.get('HTTP_POST_PIPELINE_BUFFER_SIZE',
                                                       self.DEFAULT_HTTP_POST_PIPELINE_BUFFER_SIZE)):
        deferToThread(self.send_items, self.items_buffer)
        self.items_buffer = []

def send_items(self, items):
    logging.debug("Sending batch of {} items".format(len(items)))

    serialized_items = [self.serialize_func(item) for item in items]
    requests.post(self.url, json=[json.loads(data) for data in serialized_items], headers=self.headers)

def close_spider(self, spider):
    if len(self.items_buffer) > 0:
        deferToThread(self.send_items, self.items_buffer)

Scrapy: how to send the items to the site via the api

2 Answers2