3

I am trying to run my scrapy script main.py in a docker container. The script runs 3 spiders sequentially and writes their scraped items onto a local DB. Here is the source code of main.py:

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings

from spiders.puntorigenera_spider import PuntorigeneraSpider
from spiders.lamiastampante_spider import LamiastampanteSpider
from spiders.printer_spider import PrinterSpider 

configure_logging()
crawler_settings = get_project_settings()
runner = CrawlerRunner(settings=crawler_settings)

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(PrinterSpider)
    yield runner.crawl(LamiastampanteSpider)
    yield runner.crawl(PuntorigeneraSpider)
    reactor.stop()

if __name__ == "__main__":
    crawl()
    reactor.run()

These are the DB settings specified in settings.py:

DB_SETTINGS = {
    'db': "COMPATIBILITA_PRODOTTI_SCHEMA_2",
    'user': 'root',
    'passwd': '',
    'host': 'localhost',
    'port': 3306
}

This is my Dockerfile:

# As Scrapy runs on Python, I choose the official Python 3 Docker image.
FROM python:3.7.3-stretch
 
# Set the working directory to /usr/src/app.
WORKDIR /scraper/src/docker
 
# Copy the file from the local host to the filesystem of the container at the working directory.
COPY requirements.txt ./
 
# Install Scrapy specified in requirements.txt.
RUN pip3 install --no-cache-dir -r requirements.txt
 
# Copy the project source code from the local host to the filesystem of the container at the working directory.
COPY . .
 
# Run the crawler when the container launches.
CMD [ "python3", "./scraper/scraper/main.py" ]

The structure of my project is as follows:

proj|
    |−scraper|
    |        |−scraper|
    |                 |−spiders|
    |                 |        |− ...
    |                 |        |− ...
    |                 |− main.py 
    |                 |− ...
    |− Dockerfile
    |− requirements.txt

PROBLEM

When I run python main.py it works fine. I can see the scraper running in the terminal and the DB get succesfully populated. However, when I build the docker image with the command docker build -t mycrawler . and I run it with the command docker run --network=host mycrawler I can only see this output here:

2020-11-08 13:13:48 [scrapy.crawler] INFO: Overridden settings:
{}
2020-11-08 13:13:48 [scrapy.extensions.telnet] INFO: Telnet Password: 01b06b3e6f172d1d
2020-11-08 13:13:48 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']

... and it stays like this forever, not even writing anything to the DB of course.

I am really new to Docker. Am i missing something in the dockerfile or in the way I run the built image?

giulio di zio
  • 171
  • 1
  • 11
  • `'host': 'localhost',` should be made configurable, such as `'host': os.getenv('MYSQL_HOST', 'localhost'),` because in docker, much like in a virtual machine, "localhost" means _that container_ and not your development machine or the virtual machine in which docker-machine is running (although I can't say offhand why it would _hang_ and not just error out; maybe more `print` in your `settings.py` would help track down where it's hanging) – mdaniel Nov 08 '20 at 20:53
  • I tried that and it didn′t help. Thanks for the explanation about the localhost being the one in the VM. I tried to print something in my settings.py but it doesn′t reach the print statement. – giulio di zio Nov 08 '20 at 21:08
  • I'm facing a similar problem. Have you find any solution? – Sandro Wiggers Apr 09 '22 at 23:19

0 Answers0