3

There is a folder name "data-persistent" in the running container that the code reads and writes from, I want to save the changes made in that folder. when I use persistent volume, it removes/hides the data from that folder and the code gives an error. So what should be my approach.

FROM python:latest
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
#RUN mkdir data-persistent
ADD linkedin_scrape.py .
COPY requirements.txt ./requirements.txt
COPY final_links.csv ./final_links.csv
COPY credentials.txt ./credentials.txt
COPY vectorizer.pk ./vectorizer.pk
COPY model_IvE ./model_IvE
COPY model_JvP ./model_JvP
COPY model_NvS ./model_NvS
COPY model_TvF ./model_TvF
COPY nocopy.xlsx ./nocopy.xlsx
COPY data.db /data-persistent/
COPY textdata.txt /data-persistent/
RUN ls -la /data-persistent/*
RUN pip install -r requirements.txt
CMD python linkedin_scrape.py --bind 0.0.0.0:8080 --timeout 90

And my deployment file

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-first-cluster1
spec:
  replicas: 2
  selector:
    matchLabels:
      app: scrape
  template:
    metadata:
      labels:
        app: scrape
    spec:
      containers:
      - name: scraper
       
        image: image-name
        #
        ports:
        - containerPort: 8080
        env:
        - name: PORT
          value: "8080"

        volumeMounts:
        - mountPath: "/dev/shm"
          name: dshm
        - mountPath: "/data-persistent/"
          name: tester
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
      - name: tester
        persistentVolumeClaim:
          claimName: my-pvc-claim-1

Let me explain the workflow of the code. The code reads from the textdata.txt file which contains the indices of links to be scraped e.g. from 100 to 150, then it scrapes the profiles, inserts them to data.db file and then writes to the texdata.txt file the sequence to be scraped in next run e.g. 150 to 200.

  • 2
    The best approach is to store the data somewhere else, like a database; that will let you scale up the deployment to as much compute as you need without worrying about where "the files" are. If you can't do that, you should use a StatefulSet and not a deployment, and be aware that each replica will have a different copy of "the files" and that you will be in the initial state you describe, where the mounted volume starts empty. – David Maze Jan 28 '22 at 14:16
  • If you're thinking of Docker's feature that copies data from the image into a named volume, I recommend avoiding it. It doesn't work on Docker bind mounts or in Kubernetes, and if you update the original content in the image, the volume never gets updated. You can prototype your setup using a Docker bind mount, mounting an empty host directory as the data store, and you'll see the same behavior you see in Kubernetes. – David Maze Jan 28 '22 at 14:25
  • Thanks for the response, will the StatefulSet method ensure that the data updates with each run? – Sardar Arslan Jan 28 '22 at 14:40
  • And can you kindly write the solution using stateful set? – Sardar Arslan Jan 28 '22 at 14:45
  • 1
    No, the StatefulSet will create a new empty volume for each replica and mount it in the indicated location, hiding whatever was in the image. The cluster will never copy data into that volume, your image needs to do it itself. – David Maze Jan 28 '22 at 14:58
  • Thanks I got it. – Sardar Arslan Jan 28 '22 at 15:05

1 Answers1

1

First , k8s volume mounting point overwrite the original file system /data-persistent/

To solve such a case you have many options

Solution 1

  • edit your docker file to copy local data to /tmp-data-persistent
  • then add "init container" that copy content of /tmp-data-persitent to /data-persistent that will copy the data to the volume and apply persistency

Solution 2

  • its not good to copy data in docker images , that will increase images sizes and also unityfing code and data change pipelines

  • Its better to keep data in any shared storage like "s3" , and let the "init container" compare and sync data

if cloud services like s3 not available

  • you can use persistent volume type that support multipe r/w mounts

  • attach same volume to another deployment { use busybox image as example } and do the copy with "kubectl cp"

  • scale temp deployments to zero after finalizing the copy , also you can make it as part of CI pipeline

Tamer Elfeky
  • 100
  • 4
  • Thanks for the response, can you kindly elaborate solution 1, the problem is when will the files be copied, right after starting container or before the process ends? – Sardar Arslan Jan 28 '22 at 14:43
  • check https://kubernetes.io/docs/concepts/workloads/pods/init-containers/ for init-containers idea . that copy files before the main process start , call it "pre-boot" stage – Tamer Elfeky Jan 28 '22 at 14:48
  • Thanks, but I need a solution where the files are copied after the code writes on to them. Like before the process ends, If it copies before the process starts, it will not update as the changes happen at the end of the code. The code scrapes profiles and inserts them in db after scraping and the other file is the index of links to be scraped which need to be updated during each run otherwise it scrapes the same profiles over and over. – Sardar Arslan Jan 28 '22 at 14:53
  • you can stretch the CMD to include a cp after python call ends like `python ..... && cp file /directory ` one more question,what is the usage of "COPY data.db /data-persistent/" in Dockerfile ? – Tamer Elfeky Jan 28 '22 at 15:04
  • I wanted to mount a volume but everything was in the root directory, so when i mounted volume in root directory, it gave an error as volume mount overwrites everything so i added it in data-persistent directory, I have edited my question and wrote how code works. I will try the cmd command. – Sardar Arslan Jan 28 '22 at 15:08
  • also advise to switch "Deployment" to "CronJob" as its better match the scrapping short-live process – Tamer Elfeky Jan 28 '22 at 15:12
  • I will do that next IA, once i sort the memory issues. – Sardar Arslan Jan 28 '22 at 15:13
  • @SardarArslan , kindly vote for the answer if found it useful – Tamer Elfeky Feb 01 '22 at 20:20
  • Done!!!!!!!!!!! – Sardar Arslan Feb 03 '22 at 01:09