0

I run a python webscraper to collect articles off various websites, which I then save as csv files. I have been running these manually, but recently have been trying to run them in google cloud shell. I had some trouble with the dependencies, so I decided to build a docker image to run my python scraper

So far, I have managed to create a Dockerfile that I use to build a container with all necessary dependencies.

FROM python:3
# Set the working directory to /app
WORKDIR /app

# Copy the current directory contents into the container at /app
ADD . /app

# Install any needed packages specified in requirements.txt
RUN pip install --trusted-host pypi.python.org -r requirements.txt
RUN pip install lxml
COPY Fin24 ./Fin24
COPY scraped_list.csv ./scraped_list.csv

# Run fin24.py when the container launches
CMD ["python3", "fin24.py"]

fin24.py contains my scraper. Fin24 is a txt file that holds all the base urls that my scraper crawls for article links, before going into each article and extracting content. scraped_list.csv contains all previous websites I have scraped, which my python script checks to make sure I don't scrape the same article again.

After running the above, I can see it works. The python script stops after all websites it found are scraped. However, I am guessing it is saving the csv file (output) inside the docker container. How could I get it to save it to the directory off of which I am running docker?

Ultimately I want to simply upload the Dockerfile to my Google cloud shell, and run it as a cronjob, and save all output inside the shell. Any help would be much appreciated

  • Are you looking or the VOLUME command? https://docs.docker.com/engine/reference/builder/#volume – SiKing Dec 06 '17 at 17:23

1 Answers1

0

You will require to mount that path in your docker deployment. For that you need to do two things: 1. Add a volume in your Dockerfile

WORKDIR /path/in/container
VOLUME ["/path/in/container"]

2. run your container with -v option

docker run -i -t -v /path/on/host:/path/in/container:rw "container name"
afsd
  • 152
  • 1
  • 9
  • ok great. So how exactly would that work? Would I add it to the Dockerfile script as pasted above? Does the CMD function not execute my python script when I run my docker container, which terminates once my scraper finishes. Will the volumes then retain the output from my scraper,and then copy it back to my host path? ie at /path/on/host – matthew matthee Dec 07 '17 at 13:09
  • Hi, I am sorry I shared the wrong answer. This config works when we use kubernetes for managing docker. If you are using the docker container directly, you need to do something else. I am editing the answer for that. – afsd Dec 07 '17 at 16:31
  • Also if you are using the VOLUME command before your CMD command, the output will be written in your working directory and will be persistent on the host path. – afsd Dec 07 '17 at 16:44