This functionality isn't built into Apache Beam, however there are several options. The JdbcIO uses the standard Java JDBC interface to connect to your database. It wouldn't be too difficult to overload the Mysql JDBC Driver with your own wrapper that sets up a SSH tunnel before connecting. I did a quick Google search and found a project that wraps an arbitrary JDBC driver with an SSH tunnel using SSHJ: jdbc-sshj (a copy is published to maven as com.cekrlic:jdbc-sshj:0.1.0). The project looks somewhat unmantained but it will do what you want. Add this to your runtime dependencies then update your config to something like this (this example is not secure):
pipeline.apply(JdbcIO.<KV<Integer, String>>read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(
"com.cekrlic.jdbc.ssh.tunnel.SshJDriver",
"jdbc:sshj://sshbastion?remote=database:3306&username=sshuser&password=sshpassword&verify_hosts=off;;;jdbc:mysql://localhost:3306/mydb")
.username("username")
.withPassword("password"))
.withQuery("select id,name from Person")
.withCoder(KvCoder.of(BigEndianIntegerCoder.of(), StringUtf8Coder.of()))
.withRowMapper(new JdbcIO.RowMapper<KV<Integer, String>>() {
public KV<Integer, String> mapRow(ResultSet resultSet) throws Exception {
return KV.of(resultSet.getInt(1), resultSet.getString(2));
}
})
);
If you are using Dataflow you can setup a GCE VM to act as your gateway. On that VM use SSH forwarding to tunnel the Database to the VM's external interface (ssh -R \*:3306:database:3306 sshbastion
), make the port avalable within the VPC, and then run your Dataflow job on your VPC. If your database is already running in GCP, you can use this approach to run your dataflow job on the same VPC as the database and drop the SSH step.