I run a python webscraper to collect articles off various websites, which I then save as csv files. I have been running these manually, but recently have been trying to run them in google cloud shell. I had some trouble with the dependencies, so I decided to build a docker image to run my python scraper
So far, I have managed to create a Dockerfile that I use to build a container with all necessary dependencies.
FROM python:3
# Set the working directory to /app
WORKDIR /app
# Copy the current directory contents into the container at /app
ADD . /app
# Install any needed packages specified in requirements.txt
RUN pip install --trusted-host pypi.python.org -r requirements.txt
RUN pip install lxml
COPY Fin24 ./Fin24
COPY scraped_list.csv ./scraped_list.csv
# Run fin24.py when the container launches
CMD ["python3", "fin24.py"]
fin24.py contains my scraper. Fin24 is a txt file that holds all the base urls that my scraper crawls for article links, before going into each article and extracting content. scraped_list.csv contains all previous websites I have scraped, which my python script checks to make sure I don't scrape the same article again.
After running the above, I can see it works. The python script stops after all websites it found are scraped. However, I am guessing it is saving the csv file (output) inside the docker container. How could I get it to save it to the directory off of which I am running docker?
Ultimately I want to simply upload the Dockerfile to my Google cloud shell, and run it as a cronjob, and save all output inside the shell. Any help would be much appreciated