I am trying to read from MySQL database with apache_beam.io.jdbc module (https://beam.apache.org/releases/pydoc/current/apache_beam.io.jdbc.html) ReadFromJdbc().
When I dont specify an expansion service I get the error of ValueError: Unsupported signal: 2
, which was resolved by creating a custom expansion service, so I did that (ReadFromKafka throws ValueError: Unsupported signal: 2).
I run my expansion service by java -jar beam-sdks-java-io-google-cloud-platform-expansion-service-2.43.0.jar 8096 --javaClassLookupAllowlistFile='*'
because the beam-sdks-java-io-expansion-service-2.43.0.jar
doesn't have the external transform beam:transform:org.apache.beam:schemaio_jdbc_write:v1
.
I have tried to read from a local MySQL instance with PortableRunner (Spark) and from a GCP Cloud SQL instance with DataflowRunner. In both cases I get the same error, just the payload information differs:
RuntimeError: java.lang.RuntimeException: Failed to build transform beam:transform:org.apache.beam:schemaio_jdbc_read:v1 from spec urn: "beam:transform:org.apache.beam:schemaio_jdbc_read:v1"
payload: "\nD\n\016\n\blocation\032\002\020\a\n\f\n\006config\032\002\020\t\022$a19f3f60-6fc8-41db-b221-dcf28077e554\022\220\001\002\000\nmigrations\201\001\v\002\260\006\025com.mysql.jdbc.Driver#jdbc:mysql://localhost:3306/database\004user\021pass*select id, migration from bob1.migrations;\001"
at org.apache.beam.sdk.expansion.service.ExpansionService$ExternalTransformRegistrarLoader$1.getTransform(ExpansionService.java:147)
at org.apache.beam.sdk.expansion.service.ExpansionService$TransformProvider.apply(ExpansionService.java:396)
at org.apache.beam.sdk.expansion.service.ExpansionService.expand(ExpansionService.java:516)
at org.apache.beam.sdk.expansion.service.ExpansionService.expand(ExpansionService.java:596)
at org.apache.beam.model.expansion.v1.ExpansionServiceGrpc$MethodHandlers.invoke(ExpansionServiceGrpc.java:220)
at org.apache.beam.vendor.grpc.v1p48p1.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
at org.apache.beam.vendor.grpc.v1p48p1.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:354)
at org.apache.beam.vendor.grpc.v1p48p1.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:866)
at org.apache.beam.vendor.grpc.v1p48p1.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at org.apache.beam.vendor.grpc.v1p48p1.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.RuntimeException: Unable to infer configuration row from configuration proto and schema.
at org.apache.beam.sdk.extensions.schemaio.expansion.ExternalSchemaIOTransformRegistrar.translateRow(ExternalSchemaIOTransformRegistrar.java:110)
at org.apache.beam.sdk.extensions.schemaio.expansion.ExternalSchemaIOTransformRegistrar.access$000(ExternalSchemaIOTransformRegistrar.java:49)
at org.apache.beam.sdk.extensions.schemaio.expansion.ExternalSchemaIOTransformRegistrar$ReaderBuilder.buildExternal(ExternalSchemaIOTransformRegistrar.java:129)
at org.apache.beam.sdk.extensions.schemaio.expansion.ExternalSchemaIOTransformRegistrar$ReaderBuilder.buildExternal(ExternalSchemaIOTransformRegistrar.java:115)
at org.apache.beam.sdk.expansion.service.ExpansionService$ExternalTransformRegistrarLoader$1.getTransform(ExpansionService.java:141)
... 12 more
Caused by: org.apache.beam.sdk.coders.CoderException: java.io.EOFException
at org.apache.beam.sdk.coders.BigEndianShortCoder.decode(BigEndianShortCoder.java:56)
at org.apache.beam.sdk.coders.BigEndianShortCoder.decode(BigEndianShortCoder.java:28)
at org.apache.beam.sdk.coders.RowCoderGenerator$DecodeInstruction.decodeDelegate(RowCoderGenerator.java:431)
at org.apache.beam.sdk.coders.Coder$ByteBuddy$lYYhB38b.decode(Unknown Source)
at org.apache.beam.sdk.coders.Coder$ByteBuddy$lYYhB38b.decode(Unknown Source)
at org.apache.beam.sdk.schemas.SchemaCoder.decode(SchemaCoder.java:129)
at org.apache.beam.sdk.extensions.schemaio.expansion.ExternalSchemaIOTransformRegistrar.translateRow(ExternalSchemaIOTransformRegistrar.java:108)
... 16 more
Caused by: java.io.EOFException
at org.apache.beam.sdk.coders.BitConverters.readBigEndianShort(BitConverters.java:55)
at org.apache.beam.sdk.coders.BigEndianShortCoder.decode(BigEndianShortCoder.java:52)
... 22 more
Logs from expansion service terminal:
Dec 28, 2022 3:39:34 PM org.apache.beam.sdk.expansion.service.ExpansionService expand
INFO: Expanding 'Read database list' with URN 'beam:transform:org.apache.beam:schemaio_jdbc_read:v1'
Dec 28, 2022 3:39:35 PM org.apache.beam.sdk.expansion.service.ExpansionService$ExternalTransformRegistrarLoader payloadToConfig
WARNING: Configuration class 'org.apache.beam.sdk.extensions.schemaio.expansion.ExternalSchemaIOTransformRegistrar$Configuration' has no schema registered. Attempting to construct with setter approach.
Dec 28, 2022 3:39:35 PM org.apache.beam.sdk.expansion.service.ExpansionService$ExternalTransformRegistrarLoader payloadToConfig
WARNING: Configuration class 'org.apache.beam.sdk.extensions.schemaio.expansion.ExternalSchemaIOTransformRegistrar$Configuration' has no schema registered. Attempting to construct with setter approach.
Pipeline for the local instance:
import apache_beam as beam
import apache_beam.io.jdbc as jdbc
import typing
import apache_beam.coders as coders
import os
from apache_beam.options.pipeline_options import PipelineOptions
pipeline_options = {
'runner': 'PortableRunner',
'job_endpoint': 'localhost:8099',
'environment_type':'LOOPBACK'
}
pipeline_options = PipelineOptions.from_dictionary(pipeline_options)
ExampleRow = typing.NamedTuple('ExampleRow',
[('id', int), ('migration', bytes)])
coders.registry.register_coder(ExampleRow, coders.RowCoder)
with beam.Pipeline(options=pipeline_options) as p:
res = (
p
| "Read database list" >> jdbc.ReadFromJdbc(
table_name='migrations',
driver_class_name='com.mysql.jdbc.Driver',
jdbc_url='jdbc:mysql://localhost:3306/database',
username='user',
password='pass',
query = "select id, migration from database.migrations;",
fetch_size=1,
expansion_service="localhost:8096"
)
| "Print results" >> beam.Map(print)
)
Pipeline for GCP Cloud SQL instance:
import apache_beam as beam
import apache_beam.io.jdbc as jdbc
import typing
import apache_beam.coders as coders
import os
from apache_beam.options.pipeline_options import PipelineOptions
pipeline_options = {
'project': 'project-name',
'runner': 'DataflowRunner',
'region': 'europe-central2',
'staging_location':"gs://temp",
'temp_location':"gs://temp",
'template_location':"gs://template/gcsql"
}
pipeline_options = PipelineOptions.from_dictionary(pipeline_options)
serviceAccount = r'C:\Path\To\Service\Account.json'
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = serviceAccount
ExampleRow = typing.NamedTuple('ExampleRow',
[('id', int), ('migration', str)])
coders.registry.register_coder(ExampleRow, coders.RowCoder)
with beam.Pipeline(options=pipeline_options) as p:
res = (
p
| "Read database list" >> jdbc.ReadFromJdbc(
table_name='migrations',
driver_class_name='com.mysql.jdbc.Driver',
jdbc_url='jdbc:mysql:///<DATABASE_NAME>?cloudSqlInstance=<INSTANCE_CONNECTION_NAME>&socketFactory=com.google.cloud.sql.mysql.SocketFactory&user=<MYSQL_USER_NAME>&password=<MYSQL_USER_PASSWORD>',
username='user',
password='pass',
query = "select id, migration from bob1.migrations;",
fetch_size=1,
classpath=["com.google.cloud.sql:mysql-socket-factory-connector-j-8:1.7.2"],
expansion_service='localhost:8096'
)
| "Print results" >> beam.io.WriteToText(r'gs://output/gcsql.csv')
)