2

I'm trying to migrate from airflow 1.10 to Airflow 2 which has a change of name for some operators which includes - DataprocClusterCreateOperator. Here is an extract of the code.

from airflow.providers.google.cloud.operators.dataproc import DataprocCreateClusterOperator

with dag:

    create_dataproc_cluster = dataproc_operator.DataprocCreateClusterOperator(
        task_id = "create_test_dataproc_cluster",
        project_id="project_id",
        region="us-central1",
        cluster_name="cluster_name",
        tags="dataproc",
        num_workers=2,
        storage_bucket=None,
        num_masters=1,
        master_machine_type="n1-standard-4",
        master_disk_type="pd-standard",
        master_disk_size=500,
        worker_machine_type="n1-standard-4",
        worker_disk_type="pd-standard",
        worker_disk_size=500,
        properties={},
        image_version="1.5-ubuntu18",
        autoscaling_policy=None,
        idle_delete_ttl=7200,
        optional_components=['JUPYTER', 'ANACONDA'],
        metadata={"bigquery-connector-version": '1.1.1',
                  "spark-bigquery-connector-version": '0.17.2',
                  "PIP_PACKAGES" : 'oyaml datalab'},
        init_actions_uris =['gs://goog-dataproc-initialization-actions-us-central1/connectors/connectors.sh','gs://goog-dataproc-initialization-actions-us-central1/python/pip-install.sh']
        )
        
    create_dataproc_cluster 

I am ruuning into below error:

File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/dataproc.py", line 325, in create_cluster
    result = client.create_cluster(
  File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/dataproc_v1beta2/services/cluster_controller/client.py", line 445, in create_cluster
    response = rpc(request, retry=retry, timeout=timeout, metadata=metadata,)
  File "/opt/python3.8/lib/python3.8/site-packages/google/api_core/gapic_v1/method.py", line 145, in __call__
    return wrapped_func(*args, **kwargs)
  File "/opt/python3.8/lib/python3.8/site-packages/google/api_core/timeout.py", line 102, in func_with_timeout
    return func(*args, **kwargs)
  File "/opt/python3.8/lib/python3.8/site-packages/google/api_core/grpc_helpers.py", line 67, in error_remapped_callable
    return callable_(*args, **kwargs)
  File "/opt/python3.8/lib/python3.8/site-packages/grpc/_channel.py", line 944, in __call__
    state, call, = self._blocking(request, timeout, metadata, credentials,
  File "/opt/python3.8/lib/python3.8/site-packages/grpc/_channel.py", line 926, in _blocking
    call = self._channel.segregated_call(
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 498, in grpc._cython.cygrpc.Channel.segregated_call
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 366, in grpc._cython.cygrpc._segregated_call
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 360, in grpc._cython.cygrpc._segregated_call
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 218, in grpc._cython.cygrpc._call
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 246, in grpc._cython.cygrpc._call
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 89, in grpc._cython.cygrpc._operate
  File "src/python/grpcio/grpc/_cython/_cygrpc/tag.pyx.pxi", line 64, in grpc._cython.cygrpc._BatchOperationTag.prepare
  File "src/python/grpcio/grpc/_cython/_cygrpc/operation.pyx.pxi", line 37, in grpc._cython.cygrpc.SendInitialMetadataOperation.c
  File "src/python/grpcio/grpc/_cython/_cygrpc/metadata.pyx.pxi", line 41, in grpc._cython.cygrpc._store_c_metadata
ValueError: too many values to unpack (expected 2)

On debugging I see this issue is due to the param metadata. Does anyone has idea what is wrong with this param or a way to fix it.

blackbishop
  • 30,945
  • 11
  • 55
  • 76
codninja0908
  • 497
  • 8
  • 29

1 Answers1

2

It seems that in this version the type of metadata parameter is no longer dict. From the docs:

metadata (Sequence[Tuple[str, str]]) -- Additional metadata that is provided to the method.

Try with:

metadata = [
    ("bigquery-connector-version", '1.1.1'),
    ("spark-bigquery-connector-version", '0.17.2'),
    ("PIP_PACKAGES", 'oyaml datalab')
]

Edit

According to this issue, you'll need to Generate Cluster Config then pass it to DataprocCreateClusterOperator:

from airflow.providers.google.cloud.operators.dataproc import ClusterGenerator

CLUSTER_GENERATOR_CONFIG = ClusterGenerator(
    task_id = "create_test_dataproc_cluster",
    project_id="project_id",
    region="us-central1",
    metadata={'PIP_PACKAGES': 'yaml datalab'},
    # ....
).make()

create_cluster_operator = DataprocCreateClusterOperator(
    task_id='create_dataproc_cluster',
    cluster_name="test",
    project_id="project_id",
    region="us-central1",
    cluster_config=CLUSTER_GENERATOR_CONFIG,
)
blackbishop
  • 30,945
  • 11
  • 55
  • 76
  • True, I missed this detail of type which caused the issue. But now I see new error "ValueError: metadata was invalid: [('bigquery-connector-version', '1.1.1'), ('spark-bigquery-connector-version', '0.17.2'), ('PIP_PACKAGES', 'oyaml'), ('x-goog-api-client', 'gl-python/3.8.12 grpc/1.39.0 gax/1.31.1 gccl/airflow_v2.1.2+composer')]" I believe this is due to some of the package name which is updated – codninja0908 Dec 20 '21 at 17:06
  • As per the docs [link](https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_api/airflow/providers/google/cloud/operators/dataproc/index.html#:~:text=1%20*%2060%20*%2060%2C-,metadata,-%3A%20Optional%5BSequence%5BTuple), metadata field is optional but I see if I don't provide it , it throws error. Any specific reason ? – codninja0908 Dec 21 '21 at 02:42
  • @Neha0908 hmm this is strange... what is the error you get when you don't provide this parameter? – blackbishop Dec 21 '21 at 10:44
  • This is the issue: raise _InactiveRpcError(state) grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.INVALID_ARGUMENT details = "Compute Engine instance tag '-' must match pattern (?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)" debug_error_string = "{"created":"@1640085702.011053761","description":"Error received from peer ipv4:172.217.203.95:443","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Compute Engine instance tag '-' must match pattern (?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)","grpc_status":3}" – codninja0908 Dec 21 '21 at 11:24