0

I'm using following code to connect to Databricks clusters. But got thrift.transport.TTransport.TTransportException: TSocket read 0 bytes . I think there is some error of the my_url in create_engine.

from sqlalchemy.engine import create_engine

engine = create_engine(my_url)

res = engine.execute('SELECT * FROM 
alex_db.sams_us_clubs_sams_member_index LIMIT 5')

for row in res:
    print(res)

Because this is company's cluster, I cannot show the details of my_url. Does anyone have experience in using pyhive sqlalchemy to connect to databricks cluster?

dlwlrma
  • 808
  • 2
  • 10
  • 21

2 Answers2

1

How are you wanting to access the hive? Are you running it on a spark/Databricks server?

If you are wanting to run it programatically, look at the following site about setting up the databricks via jdbc/odbc packages (https://docs.databricks.com/user-guide/bi/jdbc-odbc-bi.html?)

If you are running the application on a spark server, you can enable the hive querying like this:

// Create a SparkSession. No need to create SparkContext
// You automatically get it as part of the SparkSession
spark = SparkSession
       .builder()
       .appName("SparkSessionZipsExample")
       .config("spark.sql.warehouse.dir", warehouseLocation)
       .enableHiveSupport()
       .getOrCreate()

source: https://databricks.com/blog/2016/08/15/how-to-use-sparksession-in-apache-spark-2-0.html

0

You can create an SQL Alchemy engine for pyhive/databricks by using the function below. token: access token can be generated on the databricks console region: e.g. https://${location_of_server}.azuredatabricks.net/ database: name of the database you want to connect to http_path: sql/protocolv1/o/${organization_id}/${cluster_id}
You will need to add all required dependencies to your project.

dbfs_engine = create_engine(
        "databricks+pyhive://token:" + token + "@" + region + ".azuredatabricks.net:443/" + database, 
        connect_args={"http_path": http_path}
        )