3

at the moment we're using an Airflow version installed by ourselves on Kubernetes but the idea is to migrate on Cloud Composer. We're using Airflow to run dataflow jobs using a customized version of DataFlowJavaOperator (using a plugin) because we need to execute java application that isn't self-contained in a jar. So we basically run a bash script that lauch the command:

java -cp jar_folder/* MainClass

All of jar dependencies are stored in a shared disk between all the worker, but this feature is missing in Composer in which we're forced to use Cloud Storage to share job binaries. The problem is that running java program from a directory pointing to GCS using gcsfuse is extremely slow.

Do you have any suggestion to implement such scenario in Cloud Composer?

Thanks

stesua
  • 31
  • 3

1 Answers1

1

Composer automatically syncs content placed in the gs://{your-bucket}/dags and gs://{your-bucket}/plugins to the local Pod file system. We expect that only dag and plugin source code is copied there but don't prevent anyone from storing other binaries (though not recommended as you may exceed the disk capacity at which point the workflow execution would be affected due to insufficient local space).

fyi - the local file system paths are: /home/airflow/gcs/dags and /home/airflow/gcs/plugins, respectively.

hexacyanide
  • 88,222
  • 31
  • 159
  • 162
Feng Lu
  • 691
  • 5
  • 6
  • Thanks but I tried your solution and found that file written in those directory are cleaned cyclically so only .py file seems to be kept, furthermore those folder aren't shared among workers. We could store directly on Cloud Storage (for example /home/airflow/gcsfuse/data) that are shared but it is too slow for our purposes. – stesua Jun 12 '18 at 16:19
  • You are right, the non .py files are automatically purged in the dags/ and plugin/ folders. Have you tried passing the dataflow jar GCS object path to the [dataflow operator](https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/dataflow_operator.py#L77)? The current implementation will download the dataflow jar in GCS in a /tmp folder locally. – Feng Lu Jul 18 '18 at 05:50