6

I read post on quora which tell that Spark Thrift server is related to Apache Thrift which is d binary communication protocol. Spark Thrift server is the interface to Hive, but how does Spark Thrift server use Apache Thrift for communication with Hive via binary protocol/rpc?

pacman
  • 797
  • 10
  • 28
  • 1
    _"Spark Thrift server is the interface to Hive"_ > it is a **partial replacement** of the Hive stack. Thrift protocol is used between HiveServer2 (or Impala, or Spark) service, and the JDBC/ODBC/DBI clients. Thrift protocol is also used between the "SQL" service (or the legacy Hive CLI) and the Hive Metastore. – Samson Scharfrichter Aug 14 '17 at 06:41
  • @Samson Scharfrichter can I use `Spark Thrift` fully instead of Hive but with Spark Api? – pacman Aug 14 '17 at 07:41
  • At your own risk. The other option is to use Hive-on-Spark (i.e. Spark as the "execution engine' instead of TEZ or MapReduce), at your own risk. – Samson Scharfrichter Aug 14 '17 at 07:45
  • @Samson Scharfrichter thx, I get it, one more question - does command `start-thriftserver.sh` start Hive server? – pacman Aug 14 '17 at 07:53
  • Read post by @RussS: http://www.russellspitzer.com/2017/05/19/Spark-Sql-Thriftserver/ – T. Gawęda Aug 14 '17 at 10:13
  • @T. Gawęda "Thrift Server is still built on the HiveServer2 code, almost all of the internals are now completely Spark-native" - does it mean that Thrift server is the implementation of Hive concept but with the flavour of Spark? – pacman Aug 14 '17 at 10:25
  • It has an interface of Hive, but most of the work is calculated in Spark – T. Gawęda Aug 14 '17 at 10:27
  • @T. Gawęda as consequence, command `start-thriftserver.sh` start Hive server and provide api to beeline, am I right? – pacman Aug 14 '17 at 10:45
  • Yes, it creates Hive-compatible interface, but actions are calculated using Spark – T. Gawęda Aug 14 '17 at 11:05
  • Does this answer your question? [what is HiveServer and Thrift server](https://stackoverflow.com/questions/40924632/what-is-hiveserver-and-thrift-server) – Vkreddy Mar 26 '21 at 05:19

2 Answers2

4

Spark Thrift Server is a Hive-compatible interface for Spark.

That means, it creates implementation of HiveServer2, you can connect with beeline, however almost all the computation will be computed with Spark, not Hive.

In the previous versions, query parser was from Hive. Currently Spark Thrift Server works with Spark query parser.

Apache Thrift is a framework to develop RPC - Remote Procedure Calls - so there are many implementations using Thrift. Also Cassandra used Thrift, now it's replaced with Cassandra native protocol.

So, Apache Thrift is a framework to develop RPCs, Spark Thrift Server is an implementation of Hive protol, but it uses Spark as a computation framework.

For more details, please see this link from @RussS

T. Gawęda
  • 15,706
  • 4
  • 46
  • 61
  • how does Spark Thrift is related to Apache Thrift (rpc framework)? – pacman Aug 14 '17 at 12:05
  • As mentioned - Spark uses Apache Thrift – T. Gawęda Aug 14 '17 at 12:06
  • for what purposes Spark Thrift need for Apache Thrift? – pacman Aug 14 '17 at 12:07
  • To be honest - I don't know. Hive interface was compatible with Thrift so Spark must also be compatible with Thrfit to be comparible with Hive - complicated chain :) – T. Gawęda Aug 14 '17 at 12:10
  • Ping @pacman as I should did it in the previous comment :) Hope it helps – T. Gawęda Aug 14 '17 at 12:20
  • thx, btw i found next info - HiveServer is based on the Apache Thrift project, maybe internals of HiveServer use rpc for some purposes – pacman Aug 14 '17 at 12:25
  • @pacman, yes, Hive uses Thrift in his protocol and that's why Spark must also be Thrift-compatible to be Hive-compatible :) – T. Gawęda Aug 14 '17 at 12:27
  • 2
    A bit of history -- when Google open-sourced their ProtocolBuffer binary message format, they did not open their dedicated PB server. When Facebook open-sourced their Thrift binary message format, they also provided an out-of-the-box Thrift server. HDFS and HBase opted for PB with custom server implementations and lots of performance tweaks. Hive opted for Thrift, as-is. – Samson Scharfrichter Aug 14 '17 at 12:32
  • @Samson Scharfrichter Does `Apache Thrift` is necessary for `Spark Thrift` for providing fast RPC operations or for another target? – pacman Aug 14 '17 at 12:42
  • Thrift lets you connect to Spark over JDBC – Neil McGuigan Mar 02 '21 at 06:54
0

You can bring up the Spark thrift Server on AWS EMR using the following command - sudo /usr/lib/spark/sbin/start-thriftserver.sh --master yarn-client

On EMR, the default port for Spark thrift Server is 10001

While using the beeline for spark use the following command on EMR

/usr/lib/spark/bin/beeline -u 'jdbc:hive2://:10001/default' -e "show databases;"

By Default Hive thrift Server is always up and running on EMR but not the Spark thrift Server

You can also connect any application to the Spark thrift Server using ODBC/JDBC and can also monitor the query on EMR Cluster by Clicking the Application Master link for "org.apache.spark.sql.hive.thriftserver.HiveThriftServer2" job on Yarn Resource Manager:8088 on EMR

Saurav Bhowmick
  • 308
  • 4
  • 16