2

I need to fetch data from a Mysql server through ssh tunneling. I am using Apache Beam 2.19.0 Java JdbcIO on Google Dataflow to connect to the database. But as the database is inside a private network I need to reach the database through one in between ssh server via ssh tunneling.

Is it something achievable using apache beam jdbc IO ?

1 Answers1

1

This functionality isn't built into Apache Beam, however there are several options. The JdbcIO uses the standard Java JDBC interface to connect to your database. It wouldn't be too difficult to overload the Mysql JDBC Driver with your own wrapper that sets up a SSH tunnel before connecting. I did a quick Google search and found a project that wraps an arbitrary JDBC driver with an SSH tunnel using SSHJ: jdbc-sshj (a copy is published to maven as com.cekrlic:jdbc-sshj:0.1.0). The project looks somewhat unmantained but it will do what you want. Add this to your runtime dependencies then update your config to something like this (this example is not secure):

pipeline.apply(JdbcIO.<KV<Integer, String>>read()
  .withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(
    "com.cekrlic.jdbc.ssh.tunnel.SshJDriver",
    "jdbc:sshj://sshbastion?remote=database:3306&username=sshuser&password=sshpassword&verify_hosts=off;;;jdbc:mysql://localhost:3306/mydb")
    .username("username")
    .withPassword("password"))
  .withQuery("select id,name from Person")
  .withCoder(KvCoder.of(BigEndianIntegerCoder.of(), StringUtf8Coder.of()))
  .withRowMapper(new JdbcIO.RowMapper<KV<Integer, String>>() {
    public KV<Integer, String> mapRow(ResultSet resultSet) throws Exception {
      return KV.of(resultSet.getInt(1), resultSet.getString(2));
    }
  })
);

If you are using Dataflow you can setup a GCE VM to act as your gateway. On that VM use SSH forwarding to tunnel the Database to the VM's external interface (ssh -R \*:3306:database:3306 sshbastion), make the port avalable within the VPC, and then run your Dataflow job on your VPC. If your database is already running in GCP, you can use this approach to run your dataflow job on the same VPC as the database and drop the SSH step.

Andrew Pilloud
  • 418
  • 2
  • 6
  • Hello from the future, could you please provide more info about the gateway GCE setup ? What is sshbastion in your command ? Your approach is very interesting to me to solve this issue: https://stackoverflow.com/questions/64126581/how-to-set-up-a-ssh-tunnel-in-google-cloud-dataflow-to-an-external-database-serv – matthieu.cham Sep 30 '20 at 08:24