1

I have this project with scrapy, scrapyd and django. My crawler uses the django models to add the items to the database through the pipelines.

What i did was use a single container to start the scrapyd and the django server, but this give the problem that the scrapyd can't find the spiders even if they exist

docker-compose.yaml

version: "3"

services:
  api:
    build:
      context: .
    ports:
      - "8000:8000"
    volumes:
      - ./app:/app
    command: >
      sh -c "cd pokemon_crawler && scrapyd &
             python manage.py makemigrations  &&
             python manage.py migrate &&
             python manage.py runserver 0.0.0.0:8000"

    environment:
      - DB_HOST=db
      - DB_NAME=pokedex
      - DB_USER=postgres
      - DB_PASS=supersecretpassword
    depends_on:
      - db
  db:
    image: "postgres:10-alpine"
    environment:
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=supersecretpassword
      - POSTGRES_DB=pokedex

My crawler view to run the crawler

from rest_framework.views import APIView
from rest_framework import authentication, permissions
from rest_framework.response import Response

from scrapyd_api import ScrapydAPI


class CrawlerView(APIView):

    scrapyd = ScrapydAPI("http://localhost:6800")

    authentication_classes = [authentication.TokenAuthentication]
    permission_classes = [permissions.IsAdminUser]

    def post(self, request, format=None):
        pokemons = request.POST.getlist("pokemons", None)

        if not pokemons or not isinstance(pokemons, list):
            return Response({"error": "Missing  args"})

        pokemons = ["+"] if "all" in pokemons else pokemons

        settings = {
            "USER_AGENT": "Mozilla/5.0 (compatible; Googlebot/2.1; "
            "+http://www.google.com/bot.html)",
        }

        # Here we schedule a new crawling task from scrapyd.
        # This returns a ID which belongs and will be belong to this task
        task = self.scrapyd.schedule(
            "pokemon_crawler", "pokemon", settings=settings, pokemons=pokemons
        )

        return Response({"task_id": task, "status": "started"})

    def get(self, request, format=None):
        task_id = request.GET.get("task_id", None)

        if not task_id:
            return Response({"error": "Missing args"})

        status = self.scrapyd.job_status("pokemon_crawler", task_id)
        return Response({"status": status})

Spider:

class PokemonSpider(CrawlSpider):
    name = "pokemon"
    base_url = "https://pokemondb.net"
    allowed_domains = ["pokemondb.net", "pokemon.gameinfo.io"]
    start_urls = ["https://pokemondb.net/pokedex/all"]

    def __init__(self, *args, **kwargs):
        PokemonSpider.rules = (
            Rule(
                LinkExtractor(
                    allow=[
                        f"/pokedex/{pokemon}"
                        for pokemon in kwargs.get("pokemons")
                    ],
                    deny=("/pokedex/all",),
                ),
                callback="parse",
            ),
        )

    def parse(self, response):
        pass # i have implementation but it's irrelevant

As you can see i execute the scrapyd first and put in second plan and then run the django server. I actually didn't try to run with the docker-compose up what i did was run some django tests that i made with this command: cd pokemon_crawler && scrapyd & python manage.py test [the test to run].

But when i run this i receive the error scrapyd_api.exceptions.ScrapydResponseError: spider 'pokemon' not found

How can i fix this? Or have a better way to do this docker setup? I know i could create another container with another network just for the scrapyd, but i need to have access to the pokemon model in the scrapy pipeline to save the scraped items to the database. Can i do this with a separate container?

I followed this guide to setup everything

CAIO WANDERLEY
  • 313
  • 2
  • 9

0 Answers0