How to connect to Apache Hadoop with Impyla and Kerberos

Question

first of all I also read this question (since it seems to be simillar).

My problem is that I also try to connect to our Apache Hadoop system which is now secured by Kerberos. I use the impyla module to achieve this. Before Kerberos was installed on the Hadoop system this worked well. Now I tried different solutions on the internet and nothing seems to work, but I have to admit that I never worked with Kerberos before.

This is the code I use:

    conn = connect (host = host, 
                    port = port, 
                    auth_mechanism='GSSAPI', 
                    kerberos_service_name='impala')
    db_cursor = conn.cursor()
    db_cursor.execute ('SHOW DATABASES')
    results = db_cursor.fetchall()
    db_names = [print(x[0]) for x in results]

(host and port are passed as variables)

The error at the moment is: "no module named thrift_sasl"

Using google on that error message does not lead me to something useful, poorly. Some say that "pyKerberos" module needs to be installed, but I'm unsure if that solves the problem.

Is there something I forgot? I also have Kerberos principal and password and manage it with "MIT Kerberos Ticket Manager" But maybe I also have to provide the information in the code somehow?

Hopefully someone can help me because I'm quite stuck here. :-)

The home page for Impyla project ought to list all its dependencies, direct and indirect, with minimal versions (and possibly _exact_ versions). — Samson Scharfrichter, Jan 25 '19 at 10:34
In other words: that's yet another example of the nightmare of Python modules dependencies. Try using Anaconda, their packaging system was built exactly for that -- clean up the mess. Well, most of the mess. — Samson Scharfrichter, Jan 25 '19 at 10:38
Thanks for your comments. I tried out Anaconda and I'm surprised how good they're packaging manager is. But unfortunately it still doesn't work and there is now another error message "tsocket' object has no attribute 'isopen'". google says something about the package versions, but I can not change all of the package versions in Anaconda.. — Aquen, Jan 28 '19 at 14:52
You can override manually some dependencies that were pulled automatically -- first step is to locate the appropriate package version for the appropriate Python version, either in the "official" Anaconda repo, or in the "Forge" repo, or in a thematic/private repo -- second step is to `conda install`it by specifying a specific channel (i.e. a custom repo) and a specific version. After that it's a trial-and-error process to make sure the problems are actually fixed. And at worst... you have to scan the JIRAs and fix manually the offending Python files, as always **:-/** — Samson Scharfrichter, Jan 28 '19 at 22:35

score 1 · Answer 1 · answered Aug 04 '19 at 14:39

I ran into the same issue but i fixed it by installing the right version of required libraries.

Install below python libraries using pip:

six==1.12.0
bit_array==0.1.0
thrift==0.9.3
thrift_sasl==0.2.1
sasl==0.2.1
impyla==0.13.8

Below code is working fine with the python version 2.7 and 3.4.

import ssl
from impala.dbapi import connect
import os
os.system("kinit")
conn = connect(host='hostname.io', port=21050, use_ssl=True, database='default', user='urusername', kerberos_service_name='impala', auth_mechanism = 'GSSAPI')
cur = conn.cursor()
cur.execute('SHOW DATABASES;')
result=cur.fetchall()
for data in result:
    print (data)

score 0 · Accepted Answer · answered Feb 05 '19 at 10:11

After a long and error prone way I finally found a solution. Instead of using the library "impyla" I used another approach: I installed the cloudera ODBC driver and configured a new connection in the ODBC datasource administrator tool. I also provided the .keytab file for the authentification there (as well as user name and password and so on). Then I just used the Python library "pyodbc" like follows:

import pyodbc
import pandas


pyodbc.autocommit=True
conn = pyodbc.connect("DSN=NAMEOFYOURDSN", autocommit=True)
cursor = conn.cursor()
cursor.execute('SHOW DATABASES')
with pandas.option_context('display.max_rows', None, 'display.max_columns', None):    
     print(df)

This works well and I can start to process it further.

Ermolai · Answer 3 · 2021-11-29T08:18:12.583

I use the following setup:

OS: Ubuntu focal 20.04

$ python -V
Python 3.8.10

apt-get install libkrb5-dev krb5-user

impyla                            0.17.0     
thrift                            0.11.0     
thrift-sasl                       0.4.3  
pure-sasl                         0.6.2      
sasl                              0.3.1 
kerberos                          1.3.1

My (working) code:

Kerberos (You need a valid ticket)

conn = connect(host='myhost', port=21050, timeout=timeout, auth_mechanism="GSSAPI", use_ssl=True, kerberos_service_name='impala')

LDAP

conn = connect(host='myhost', port=21050, auth_mechanism='LDAP', password='ldap_pass', user='user', use_ssl=True)

or

conn = connect(host='myhost', port=21050, auth_mechanism='LDAP', password=ldap_pass, user='user', use_ssl=True, ca_cert="my/cert")

After connection(with either method), run the following example:

cursor = conn.cursor()
cursor.execute('show databases')
print(cursor.fetchall())

How to connect to Apache Hadoop with Impyla and Kerberos

3 Answers3