I have this project with scrapy, scrapyd and django. My crawler uses the django models to add the items to the database through the pipelines.
What i did was use a single container to start the scrapyd and the django server, but this give the problem that the scrapyd can't find the spiders even if they exist
docker-compose.yaml
version: "3"
services:
api:
build:
context: .
ports:
- "8000:8000"
volumes:
- ./app:/app
command: >
sh -c "cd pokemon_crawler && scrapyd &
python manage.py makemigrations &&
python manage.py migrate &&
python manage.py runserver 0.0.0.0:8000"
environment:
- DB_HOST=db
- DB_NAME=pokedex
- DB_USER=postgres
- DB_PASS=supersecretpassword
depends_on:
- db
db:
image: "postgres:10-alpine"
environment:
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=supersecretpassword
- POSTGRES_DB=pokedex
My crawler view to run the crawler
from rest_framework.views import APIView
from rest_framework import authentication, permissions
from rest_framework.response import Response
from scrapyd_api import ScrapydAPI
class CrawlerView(APIView):
scrapyd = ScrapydAPI("http://localhost:6800")
authentication_classes = [authentication.TokenAuthentication]
permission_classes = [permissions.IsAdminUser]
def post(self, request, format=None):
pokemons = request.POST.getlist("pokemons", None)
if not pokemons or not isinstance(pokemons, list):
return Response({"error": "Missing args"})
pokemons = ["+"] if "all" in pokemons else pokemons
settings = {
"USER_AGENT": "Mozilla/5.0 (compatible; Googlebot/2.1; "
"+http://www.google.com/bot.html)",
}
# Here we schedule a new crawling task from scrapyd.
# This returns a ID which belongs and will be belong to this task
task = self.scrapyd.schedule(
"pokemon_crawler", "pokemon", settings=settings, pokemons=pokemons
)
return Response({"task_id": task, "status": "started"})
def get(self, request, format=None):
task_id = request.GET.get("task_id", None)
if not task_id:
return Response({"error": "Missing args"})
status = self.scrapyd.job_status("pokemon_crawler", task_id)
return Response({"status": status})
Spider:
class PokemonSpider(CrawlSpider):
name = "pokemon"
base_url = "https://pokemondb.net"
allowed_domains = ["pokemondb.net", "pokemon.gameinfo.io"]
start_urls = ["https://pokemondb.net/pokedex/all"]
def __init__(self, *args, **kwargs):
PokemonSpider.rules = (
Rule(
LinkExtractor(
allow=[
f"/pokedex/{pokemon}"
for pokemon in kwargs.get("pokemons")
],
deny=("/pokedex/all",),
),
callback="parse",
),
)
def parse(self, response):
pass # i have implementation but it's irrelevant
As you can see i execute the scrapyd first and put in second plan and then run the django server. I actually didn't try to run with the docker-compose up
what i did was run some django tests that i made with this command: cd pokemon_crawler && scrapyd & python manage.py test [the test to run]
.
But when i run this i receive the error scrapyd_api.exceptions.ScrapydResponseError: spider 'pokemon' not found
How can i fix this? Or have a better way to do this docker setup? I know i could create another container with another network just for the scrapyd, but i need to have access to the pokemon model in the scrapy pipeline to save the scraped items to the database. Can i do this with a separate container?
I followed this guide to setup everything