5

I’m trying execute this dag bellow. It seems that the operator creating a dataproc cluster does not enable enabling the optional components to enable jupyter notebook and anaconda. I found this code here: Component Gateway with DataprocOperator on Airflow to try to solve it, but for me it didn't solve it because i thikn the composer(airflow) version here is diferente. my version is composer - 2.0.0-preview.5, airflow-2.1.4.

The operator works perfectly when creating the cluster, but it doesn't create with the optional component to enable jupyter notebook. Does anyone have any ideas to help me?

from airflow.contrib.sensors.gcs_sensor import GoogleCloudStoragePrefixSensor
from airflow import DAG
from datetime import datetime, timedelta
from airflow.contrib.operators.dataproc_operator import DataprocClusterCreateOperator,DataprocClusterDeleteOperator, DataProcSparkOperator
from airflow.contrib.operators.gcs_to_bq import GoogleCloudStorageToBigQueryOperator

yesterday = datetime.combine(datetime.today() - timedelta(1),
                             datetime.min.time())


default_args = {
    'owner': 'teste3',
    'depends_on_past': False,
    'start_date' :yesterday,
    'email': ['airflow@example.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 0,
    'retry_delay': timedelta(minutes=5),

}

dag = DAG(
    'teste-dag-3',catchup=False, default_args=default_args, schedule_interval=None)


# configura os componentes
class CustomDataprocClusterCreateOperator(DataprocClusterCreateOperator):

    def __init__(self, *args, **kwargs):
        super(CustomDataprocClusterCreateOperator, self).__init__(*args, **kwargs)

    def _build_cluster_data(self):
        cluster_data = super(CustomDataprocClusterCreateOperator, self)._build_cluster_data()
        cluster_data['config']['endpointConfig'] = {
            'enableHttpPortAccess': True
        }
        cluster_data['config']['softwareConfig']['optionalComponents'] = [ 'JUPYTER', 'ANACONDA' ]
        return cluster_data


create_cluster=CustomDataprocClusterCreateOperator(
        dag=dag,
        task_id='start_cluster_example',
        cluster_name='teste-ge-{{ ds }}',
        project_id= "sandbox-coe",
        num_workers=2,
        num_masters=1,
        master_machine_type='n2-standard-8',
        worker_machine_type='n2-standard-8',
        worker_disk_size=500,
        master_disk_size=500,
        master_disk_type='pd-ssd',
        worker_disk_type='pd-ssd',
        image_version='1.5.56-ubuntu18',
        tags=['allow-dataproc-internal'],
        region="us-central1",
        zone='us-central1-f',#Variable.get('gc_zone'),
        storage_bucket = "bucket-dataproc-ge",
        labels = {'product' : 'sample-label'},
        service_account_scopes = ['https://www.googleapis.com/auth/cloud-platform'],
        #properties={"yarn:yarn.nodemanager.resource.memory-mb" : 15360,"yarn:yarn.scheduler.maximum-allocation-mb" : 15360},
        #subnetwork_uri="projects/project-id/regions/us-central1/subnetworks/dataproc-subnet",
        retries= 1,
        retry_delay=timedelta(minutes=1)
    ) #starts a dataproc cluster


stop_cluster_example = DataprocClusterDeleteOperator(
    dag=dag,
    task_id='stop_cluster_example',
    cluster_name='teste-ge-{{ ds }}',
    project_id="sandbox-coe",
    region="us-central1",
    ) #stops a running dataproc cluster




create_cluster  >> stop_cluster_example

1 Answers1

2

Edit: After took a deeper look you don't need a custom operator any more. The updated operator DataprocCreateClusterOperator has enable_component_gateway and optional_components so you can just set them directly:

from airflow.providers.google.cloud.operators.dataproc import ClusterGenerator, DataprocCreateClusterOperator

CLUSTER_GENERATOR = ClusterGenerator(
    project_id=PROJECT_ID,
    region=REGION,
    ...,
    enable_component_gateway=True,
    optional_components = [ 'JUPYTER', 'ANACONDA' ]
).make()

DataprocCreateClusterOperator(
    ...,
    cluster_config=CLUSTER_GENERATOR
)

You can check this example dag for more details. You can view all possible parameters of ClusterGenerator in the source code.

Original Answer: The operator was re-written (see PR). I think the issue is with your _build_cluster_data function.

You probably should change your code to:

def _build_cluster_data(self):
    cluster_data = super(CustomDataprocClusterCreateOperator, self)._build_cluster_data()
    cluster_data['config']['endpoint_config'] = {
        'enableHttpPortAccess': True
    }
    cluster_data['config']['software_config']['optional_components'] = [ 'JUPYTER', 'ANACONDA' ] # redundant see comment 2
    return cluster_data

A few notes:

  1. CustomDataprocClusterCreateOperator is deprecated. You should use DataprocCreateClusterOperator from the google provider.

  2. You don't need to have cluster_data['config']['endpoint_config'] you can set the value directly by passing optional_components to the operator with see source code.

Elad Kalif
  • 14,110
  • 2
  • 17
  • 49
  • @ThiagoPositeliDeArruda Did it solve your issue? If so please accept the answer – Elad Kalif Feb 22 '22 at 16:23
  • Hi @Elad thanks for the help. I Pass exactly like this right now but i got this error in airflow: File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/baseoperator.py", line 178, in apply_defaults result = func(self, *args, **kwargs) File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/baseoperator.py", line 506, in __init__ raise AirflowException( airflow.exceptions.AirflowException: Invalid arguments were passed to DataprocCreateClusterOperator (task_id: start_cluster_example). Invalid arguments were: **kwargs: {'enable_component_gateway': True} – Thiago Positeli De Arruda Feb 22 '22 at 16:27
  • i just set in operator directly – Thiago Positeli De Arruda Feb 22 '22 at 16:28
  • the same error in updated answer. File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/baseoperator.py", line 178, in apply_defaults result = func(self, *args, **kwargs) File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/baseoperator.py", line 506, in __init__ raise AirflowException( airflow.exceptions.AirflowException: Invalid arguments were passed to DataprocCreateClusterOperator (task_id: start_cluster_example). Invalid arguments were: **kwargs: {'optional_components': ['JUPYTER', 'ANACONDA']} – Thiago Positeli De Arruda Feb 22 '22 at 17:42
  • when i remove "optional_components = [ 'JUPYTER', 'ANACONDA' ]" from operator, the dag run perfectly with the "enable_component_gateway=True", but when the cluster is created the interface web(component gateway) is disable – Thiago Positeli De Arruda Feb 22 '22 at 17:46
  • @ThiagoPositeliDeArruda sorry both are parameters of ClusterGenerator check now – Elad Kalif Feb 22 '22 at 17:51
  • some parts work like a charm. the cluster enable this "optional_components = [ 'JUPYTER', 'ANACONDA' ]", but Web Interface to access jupyter notebook is disable. You no why?, i set this in cluster config too "enable_component_gateway=True". – Thiago Positeli De Arruda Feb 22 '22 at 18:07
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/242280/discussion-between-thiago-positeli-de-arruda-and-elad). – Thiago Positeli De Arruda Feb 22 '22 at 18:10
  • If this wasn't resolved either there is a problem from your side or a bug in the operator. If its a bug then please open a GitHub issue: https://github.com/apache/airflow/issues – Elad Kalif Feb 22 '22 at 19:49
  • @ThiagoPositeliDeArruda, Could you please let me know if your issue is resolved or not? – Prajna Rai T Feb 23 '22 at 05:48
  • Thanks @Elad I'm still researching to see if the problem is on my side or it's a bug, but I believe my code is correct, because it creates and deletes the cluster normally and with the latest tweaks you suggested it enabled additional components like jupyter and anaconda, but did not activate the web interface, even setting the property to True – Thiago Positeli De Arruda Feb 23 '22 at 18:24
  • @PrajnaRaiT its not resolved completely. i put this "optional_components = [ 'JUPYTER', 'ANACONDA' ]" in cluster_generator as in the solution above and the web interface in cluster dataproc still disable. – Thiago Positeli De Arruda Feb 23 '22 at 18:26
  • Sorry but I don't think there is more we can do here. It's either bug in your settings or bug in the open source (Airflow or google python package). Both cases are out of scope for StackOverflow. – Elad Kalif Feb 23 '22 at 21:31
  • @ThiagoPositeliDeArruda, I have also tried to enable component gateway to access web interfaces, by including enable_component_gateway=True along with optional_components=['JUPYTER', 'ANACONDA'] in the code. Still the component gateway was in the disabled state. Also, I found the Github issue you had raised. Posting the [issue link](https://github.com/apache/airflow/issues/21800) here for posterity. – Prajna Rai T Feb 28 '22 at 08:18
  • Yeah this seems to be a bug in Airflow. – Elad Kalif Feb 28 '22 at 08:59
  • yeah, i posted this issue @PrajnaRaiT. – Thiago Positeli De Arruda Mar 02 '22 at 16:05
  • The bug has been fixed and now working in the new airflow updated 2.2.3 and composer version 2.0.5. – Thiago Positeli De Arruda Mar 08 '22 at 12:31
  • If so then this was probably a bug in composer rather than Airflow as we didn't release new Airflow version... – Elad Kalif Mar 08 '22 at 12:35
  • @Elad i think u are right. Much thanks for the help and the solution u give me. – Thiago Positeli De Arruda Mar 08 '22 at 13:30
  • If solved then kindly accept the answer as it is the solution after composer fixed whatever bug they had on thier side. – Elad Kalif Mar 08 '22 at 13:34