I am trying to run my scrapy script main.py
in a docker container.
The script runs 3 spiders sequentially and writes their scraped items onto a local DB.
Here is the source code of main.py
:
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from spiders.puntorigenera_spider import PuntorigeneraSpider
from spiders.lamiastampante_spider import LamiastampanteSpider
from spiders.printer_spider import PrinterSpider
configure_logging()
crawler_settings = get_project_settings()
runner = CrawlerRunner(settings=crawler_settings)
@defer.inlineCallbacks
def crawl():
yield runner.crawl(PrinterSpider)
yield runner.crawl(LamiastampanteSpider)
yield runner.crawl(PuntorigeneraSpider)
reactor.stop()
if __name__ == "__main__":
crawl()
reactor.run()
These are the DB settings specified in settings.py
:
DB_SETTINGS = {
'db': "COMPATIBILITA_PRODOTTI_SCHEMA_2",
'user': 'root',
'passwd': '',
'host': 'localhost',
'port': 3306
}
This is my Dockerfile
:
# As Scrapy runs on Python, I choose the official Python 3 Docker image.
FROM python:3.7.3-stretch
# Set the working directory to /usr/src/app.
WORKDIR /scraper/src/docker
# Copy the file from the local host to the filesystem of the container at the working directory.
COPY requirements.txt ./
# Install Scrapy specified in requirements.txt.
RUN pip3 install --no-cache-dir -r requirements.txt
# Copy the project source code from the local host to the filesystem of the container at the working directory.
COPY . .
# Run the crawler when the container launches.
CMD [ "python3", "./scraper/scraper/main.py" ]
The structure of my project is as follows:
proj|
|−scraper|
| |−scraper|
| |−spiders|
| | |− ...
| | |− ...
| |− main.py
| |− ...
|− Dockerfile
|− requirements.txt
PROBLEM
When I run python main.py
it works fine. I can see the scraper running in the terminal and the DB get succesfully populated.
However, when I build the docker image with the command docker build -t mycrawler .
and I run it with the command docker run --network=host mycrawler
I can only see this output here:
2020-11-08 13:13:48 [scrapy.crawler] INFO: Overridden settings:
{}
2020-11-08 13:13:48 [scrapy.extensions.telnet] INFO: Telnet Password: 01b06b3e6f172d1d
2020-11-08 13:13:48 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
... and it stays like this forever, not even writing anything to the DB of course.
I am really new to Docker. Am i missing something in the dockerfile or in the way I run the built image?