Hi I am trying to create a datasource for apache beam in python. I know with Java you can connect to cloudsql using the JDBC library. Similarly I am trying to create a source for dataflow(apache beam) in Google Cloud Platform. I have inherited from the Bounded source class and used the jaydebeapi library (python wrapper for jdbc library) to connect to the mysql database. Please see code below.
# Create new source for Cloud SQL
class odbcsource(iobase.BoundedSource):
def __init__(self, server=None, driver=None, database=None, username=None, password=None, sql=None, port=None, driver_path=None):
self.server = server
self.driver = driver
self.database = database
self.username = username
self.password = password
self.sql = sql
self.port = port
self.driver_path = driver_path
def read(self):
cursor = self._query_mssql()
results = []
for row in cursor.fetchall():
results.append(row)
def _query_mssql(self):
"""
Queries mssql and returns a cursor to the results.
"""
conn = jaydebeapi.connect(self.driver,
"jdbc:mysql://"+self.server+":"+self.port+"/"+self.database,
{'user': self.username, 'password': self.password},
self.driver_path,)
cursor = conn.cursor()
cursor.execute(self.sql)
return cursor
For the .jar file driver I have stored this in google cloud storage in a temp file location. However, python requires the Java Developer Kit to run java code and while running locally on my computer I can set the JAVA_HOME variable and point to the /bin location on my local computer.
However when I am running this in dataflow I get an error "NO JVM shared library file found. Try setting the JAVA_HOME environment variable properly.This is because in dataflow I cannot install the Java Developer Kit (JDK) or create an Environmental variable.
Is there a way to install the JDK on dataflow and reference environmental variables. Also any thoughts on how to run a python apache beam job to extract data from a cloud sql database this way?