0

I'm using Airflow with kubernetes executor and the KubernetesPodOperator. I have two jobs:

  • A: Retrieve data from some source up to 100MB
  • B: Analyze the data from A.

In order to be able to share the data between the jobs, I would like to run them on the same pod, and then A will write the data to a volume, and B will read the data from the volume.

The documentation states:

The Kubernetes executor will create a new pod for every task instance.

Is there any way to achieve this? And if not, what recommended way there is to pass the data between the jobs?

matanper
  • 881
  • 8
  • 24

4 Answers4

3

Sorry this isn't possible - one job per pod.

You are best to use task 1 to put the data in a well known location (e.g in a cloud bucket) and get it from the second task. Or just combine the two tasks.

eamon1234
  • 1,555
  • 3
  • 19
  • 38
  • Thanks. the initial idea was to make an abstraction, so the first job (which I have multiple implementations of this type of job) wouldn't need to be aware of some cloud storage. is this incorrect approach? – matanper Jun 11 '19 at 19:20
  • Airflow is very flexible so it's whatever works for you. But the approach I've seen is if you want to pass data between tasks, best practice is to store it somewhere outside airflow. Task 2 doesn't necessarily need to be aware of storage, but you can pass the task_id and date or whatever into a function and that will download the file from wherever and put it somewhere locally so that task2 can do its thing. – eamon1234 Jun 12 '19 at 11:56
  • task2 initiates a new pod with specific image, how can I have a process which download data before the execution of the image? – matanper Jun 12 '19 at 20:10
2

You can absolutely accomplish this using subdags and the SubDag operator. When you start a subdag the kubernetes executor creates one pod at the subdag level and all subtasks run on that pod.

This behavior does not seem to be documented. We just discovered this recently when troubleshooting a process.

trejas
  • 991
  • 7
  • 17
  • interesting. The question was regarding `PodOperator` as distinct from the k8s executor pod. Does this method also work for the `PodOperator`? – eamon1234 Jun 12 '19 at 14:10
  • Apologies. I missed the part about the PodOperator. Not sure if it would work, gut feeling is this is a bit too “Inception”. – trejas Jun 12 '19 at 14:11
1

yes you can do that using init containers inside job so in the same pod the job will not trigger before the init containers complete its task

apiVersion: v1
kind: Pod
metadata:
  name: myapp-pod
  labels:
    app: myapp
spec:
  containers:
  - name: myapp-container
    image: busybox:1.28
    command: ['sh', '-c', 'echo The app is running! && sleep 3600']
  initContainers:
  - name: init-myservice
    image: busybox:1.28
    command: ['sh', '-c', 'until nslookup myservice; do echo waiting for myservice; sleep 2; done;']
  - name: init-mydb
    image: busybox:1.28
    command: ['sh', '-c', 'until nslookup mydb; do echo waiting for mydb; sleep 2; done;']

this an example for pod and you can apply the same for kind job

Semah Mhamdi
  • 160
  • 1
  • 10
0

You can have 2 separate tasks A and B where data can be handed of from A to B. K8S has out of box support for such type of volumes. E.g. https://kubernetes.io/docs/concepts/storage/volumes/#awselasticblockstore. Here data will be generated by one pod will be persistent so when the pod gets deleted data won't be lost. The same volume can be mounted by another pod and can access the data.

sdvd
  • 433
  • 2
  • 10