I've already had a working connection through ODBC using Cloudera ODBC Driver for Apache Hive, where I had my DSN set and all I needed was to call pyodbc.connect(f"DSN={mydsn}", autocommit=True)
.
Since I'm planning to use pandas on the query result, I've read that SQLAlchemy is the preferred choice and I'd like to avoid warnings resulting from other ways of connection. My DSN for Hive was using Zookeeper and "Hosts" field was filled in the form of host1:2181,host2:2181,host3:2181
. I'm trying to connect to these 3 hosts and I've tried changing connection url in analogous way to the one provided in here, but I got invalid literal for int() with base 10: '2181,host2:2181,host3:2181
etc.
from sqlalchemy import create_engine
query = """SELECT TOP 10 * from eb.mobile_sa"""
conn_url = f'hive://{UID}@host1:2181,host2:2181,host3:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2'
engine = create_engine(conn_url)
with engine.connect() as conn:
df = pd.read_sql(query, conn)
I found kazoo module that is said to be Zookeeper implementation in Python, but when I tried the very first lines from Basic Usage and just 1 host:
from kazoo.client import KazooClient
zk = KazooClient(hosts = "host1:2181", read_only=True)
zk.start()
I got a lot of lines of Connection dropped: socket connection error
How can I correctly connect to multiple hosts in hive?