Setting Environment variables in apache beam while creating a datasource for cloudsql/mysql for python apache beam sdk

Question

Hi I am trying to create a datasource for apache beam in python. I know with Java you can connect to cloudsql using the JDBC library. Similarly I am trying to create a source for dataflow(apache beam) in Google Cloud Platform. I have inherited from the Bounded source class and used the jaydebeapi library (python wrapper for jdbc library) to connect to the mysql database. Please see code below.

# Create new source for Cloud SQL

class odbcsource(iobase.BoundedSource):

  def __init__(self, server=None, driver=None, database=None, username=None, password=None, sql=None, port=None, driver_path=None):
      self.server = server
      self.driver = driver
      self.database = database
      self.username = username
      self.password = password
      self.sql = sql
      self.port = port
      self.driver_path = driver_path


  def read(self):
      cursor = self._query_mssql()
      results = []
      for row in cursor.fetchall():
          results.append(row)

  def _query_mssql(self):
     """
     Queries mssql and returns a cursor to the results.
     """
     conn = jaydebeapi.connect(self.driver,
                           "jdbc:mysql://"+self.server+":"+self.port+"/"+self.database,
                          {'user': self.username, 'password': self.password},
                          self.driver_path,)
     cursor = conn.cursor()
     cursor.execute(self.sql)
     return cursor

For the .jar file driver I have stored this in google cloud storage in a temp file location. However, python requires the Java Developer Kit to run java code and while running locally on my computer I can set the JAVA_HOME variable and point to the /bin location on my local computer.

However when I am running this in dataflow I get an error "NO JVM shared library file found. Try setting the JAVA_HOME environment variable properly.This is because in dataflow I cannot install the Java Developer Kit (JDK) or create an Environmental variable.

Is there a way to install the JDK on dataflow and reference environmental variables. Also any thoughts on how to run a python apache beam job to extract data from a cloud sql database this way?

related: https://stackoverflow.com/questions/46528343/how-to-use-gcp-cloud-sql-as-dataflow-source-and-or-sink-with-python — Andrew Cassidy, Jun 21 '18 at 20:59

Setting Environment variables in apache beam while creating a datasource for cloudsql/mysql for python apache beam sdk

0 Answers0