1

I am trying to read from MySQL database with apache_beam.io.jdbc module (https://beam.apache.org/releases/pydoc/current/apache_beam.io.jdbc.html) ReadFromJdbc().

When I dont specify an expansion service I get the error of ValueError: Unsupported signal: 2, which was resolved by creating a custom expansion service, so I did that (ReadFromKafka throws ValueError: Unsupported signal: 2).

I run my expansion service by java -jar beam-sdks-java-io-google-cloud-platform-expansion-service-2.43.0.jar 8096 --javaClassLookupAllowlistFile='*' because the beam-sdks-java-io-expansion-service-2.43.0.jar doesn't have the external transform beam:transform:org.apache.beam:schemaio_jdbc_write:v1.

I have tried to read from a local MySQL instance with PortableRunner (Spark) and from a GCP Cloud SQL instance with DataflowRunner. In both cases I get the same error, just the payload information differs:

RuntimeError: java.lang.RuntimeException: Failed to build transform beam:transform:org.apache.beam:schemaio_jdbc_read:v1 from spec urn: "beam:transform:org.apache.beam:schemaio_jdbc_read:v1"
payload: "\nD\n\016\n\blocation\032\002\020\a\n\f\n\006config\032\002\020\t\022$a19f3f60-6fc8-41db-b221-dcf28077e554\022\220\001\002\000\nmigrations\201\001\v\002\260\006\025com.mysql.jdbc.Driver#jdbc:mysql://localhost:3306/database\004user\021pass*select id, migration from bob1.migrations;\001"

        at org.apache.beam.sdk.expansion.service.ExpansionService$ExternalTransformRegistrarLoader$1.getTransform(ExpansionService.java:147)
        at org.apache.beam.sdk.expansion.service.ExpansionService$TransformProvider.apply(ExpansionService.java:396)
        at org.apache.beam.sdk.expansion.service.ExpansionService.expand(ExpansionService.java:516)
        at org.apache.beam.sdk.expansion.service.ExpansionService.expand(ExpansionService.java:596)
        at org.apache.beam.model.expansion.v1.ExpansionServiceGrpc$MethodHandlers.invoke(ExpansionServiceGrpc.java:220)
        at org.apache.beam.vendor.grpc.v1p48p1.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
        at org.apache.beam.vendor.grpc.v1p48p1.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:354)
        at org.apache.beam.vendor.grpc.v1p48p1.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:866)
        at org.apache.beam.vendor.grpc.v1p48p1.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at org.apache.beam.vendor.grpc.v1p48p1.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.RuntimeException: Unable to infer configuration row from configuration proto and schema.
        at org.apache.beam.sdk.extensions.schemaio.expansion.ExternalSchemaIOTransformRegistrar.translateRow(ExternalSchemaIOTransformRegistrar.java:110)
        at org.apache.beam.sdk.extensions.schemaio.expansion.ExternalSchemaIOTransformRegistrar.access$000(ExternalSchemaIOTransformRegistrar.java:49)
        at org.apache.beam.sdk.extensions.schemaio.expansion.ExternalSchemaIOTransformRegistrar$ReaderBuilder.buildExternal(ExternalSchemaIOTransformRegistrar.java:129)      
        at org.apache.beam.sdk.extensions.schemaio.expansion.ExternalSchemaIOTransformRegistrar$ReaderBuilder.buildExternal(ExternalSchemaIOTransformRegistrar.java:115)      
        at org.apache.beam.sdk.expansion.service.ExpansionService$ExternalTransformRegistrarLoader$1.getTransform(ExpansionService.java:141)
        ... 12 more
Caused by: org.apache.beam.sdk.coders.CoderException: java.io.EOFException
        at org.apache.beam.sdk.coders.BigEndianShortCoder.decode(BigEndianShortCoder.java:56)
        at org.apache.beam.sdk.coders.BigEndianShortCoder.decode(BigEndianShortCoder.java:28)
        at org.apache.beam.sdk.coders.RowCoderGenerator$DecodeInstruction.decodeDelegate(RowCoderGenerator.java:431)
        at org.apache.beam.sdk.coders.Coder$ByteBuddy$lYYhB38b.decode(Unknown Source)
        at org.apache.beam.sdk.coders.Coder$ByteBuddy$lYYhB38b.decode(Unknown Source)
        at org.apache.beam.sdk.schemas.SchemaCoder.decode(SchemaCoder.java:129)
        at org.apache.beam.sdk.extensions.schemaio.expansion.ExternalSchemaIOTransformRegistrar.translateRow(ExternalSchemaIOTransformRegistrar.java:108)
        ... 16 more
Caused by: java.io.EOFException
        at org.apache.beam.sdk.coders.BitConverters.readBigEndianShort(BitConverters.java:55)
        at org.apache.beam.sdk.coders.BigEndianShortCoder.decode(BigEndianShortCoder.java:52)
        ... 22 more

Logs from expansion service terminal:

Dec 28, 2022 3:39:34 PM org.apache.beam.sdk.expansion.service.ExpansionService expand
INFO: Expanding 'Read database list' with URN 'beam:transform:org.apache.beam:schemaio_jdbc_read:v1'
Dec 28, 2022 3:39:35 PM org.apache.beam.sdk.expansion.service.ExpansionService$ExternalTransformRegistrarLoader payloadToConfig
WARNING: Configuration class 'org.apache.beam.sdk.extensions.schemaio.expansion.ExternalSchemaIOTransformRegistrar$Configuration' has no schema registered. Attempting to construct with setter approach.
Dec 28, 2022 3:39:35 PM org.apache.beam.sdk.expansion.service.ExpansionService$ExternalTransformRegistrarLoader payloadToConfig
WARNING: Configuration class 'org.apache.beam.sdk.extensions.schemaio.expansion.ExternalSchemaIOTransformRegistrar$Configuration' has no schema registered. Attempting to construct with setter approach.

Pipeline for the local instance:

import apache_beam as beam
import apache_beam.io.jdbc as jdbc
import typing
import apache_beam.coders as coders

import os
from apache_beam.options.pipeline_options import PipelineOptions

pipeline_options = {
    'runner': 'PortableRunner',
    'job_endpoint': 'localhost:8099',
    'environment_type':'LOOPBACK'
}


pipeline_options = PipelineOptions.from_dictionary(pipeline_options)

ExampleRow = typing.NamedTuple('ExampleRow',
                               [('id', int), ('migration', bytes)])
coders.registry.register_coder(ExampleRow, coders.RowCoder)


with beam.Pipeline(options=pipeline_options) as p:
    res = (
        p
        | "Read database list" >> jdbc.ReadFromJdbc(
            table_name='migrations',
            driver_class_name='com.mysql.jdbc.Driver',
            jdbc_url='jdbc:mysql://localhost:3306/database',
            username='user',
            password='pass',
            query = "select id, migration from database.migrations;",
            fetch_size=1,
            expansion_service="localhost:8096"
        )
        | "Print results" >> beam.Map(print)
    )

Pipeline for GCP Cloud SQL instance:

import apache_beam as beam
import apache_beam.io.jdbc as jdbc
import typing
import apache_beam.coders as coders

import os
from apache_beam.options.pipeline_options import PipelineOptions

pipeline_options = {
    'project': 'project-name',
    'runner': 'DataflowRunner',
    'region': 'europe-central2',
    'staging_location':"gs://temp",
    'temp_location':"gs://temp",
    'template_location':"gs://template/gcsql"
}
pipeline_options = PipelineOptions.from_dictionary(pipeline_options)

serviceAccount = r'C:\Path\To\Service\Account.json'
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = serviceAccount

ExampleRow = typing.NamedTuple('ExampleRow',
                               [('id', int), ('migration', str)])
coders.registry.register_coder(ExampleRow, coders.RowCoder)


with beam.Pipeline(options=pipeline_options) as p:
    res = (
        p
        | "Read database list" >> jdbc.ReadFromJdbc(
            table_name='migrations',
            driver_class_name='com.mysql.jdbc.Driver',
            jdbc_url='jdbc:mysql:///<DATABASE_NAME>?cloudSqlInstance=<INSTANCE_CONNECTION_NAME>&socketFactory=com.google.cloud.sql.mysql.SocketFactory&user=<MYSQL_USER_NAME>&password=<MYSQL_USER_PASSWORD>',
            username='user',
            password='pass',
            query = "select id, migration from bob1.migrations;",
            fetch_size=1,
            classpath=["com.google.cloud.sql:mysql-socket-factory-connector-j-8:1.7.2"],
            expansion_service='localhost:8096'
        )
        | "Print results" >> beam.io.WriteToText(r'gs://output/gcsql.csv')
    )
AleksF
  • 25
  • 4
  • Hi @AleksF, Are you running multi language beam pipelines? Could you specify in which step you're facing error, expansion service or the pipeline code? – Shipra Sarkar Jan 04 '23 at 12:04
  • @ShipraSarkar yes, I am. Well, it's hard to tell. I can start the expansion service without a problem, but when I run the pipeline and it actually expands it, I get the error messages. Does that help? – AleksF Jan 05 '23 at 13:17
  • @AleksF Did you able to resolve this issue? – JoRoot May 04 '23 at 04:59

1 Answers1

2

The JDBC transform needs a schema-io expansion service: https://mvnrepository.com/artifact/org.apache.beam/beam-sdks-java-extensions-schemaio-expansion-service (not the google-cloud-platform-expansion-service)

Yi Hu
  • 51
  • 2
  • Agree. Specifically this jar - https://mvnrepository.com/artifact/org.apache.beam/beam-sdks-java-extensions-schemaio-expansion-service/2.43.0 – chamikara Jan 09 '23 at 17:30
  • 1
    Also, you should not have to startup an expansion service when using released Beam. Beam should automatically startup an expansion service for you. – chamikara Jan 09 '23 at 17:31