Convert Hive Sql to Spark Sql

Question

i want to convert my Hive Sql to Spark Sql to test the performance of query. Here is my Hive Sql. Can anyone suggests me how to convert the Hive Sql to Spark Sql.

SELECT split(DTD.TRAN_RMKS,'/')[0] AS TRAB_RMK1,
split(DTD.TRAN_RMKS,'/')[1] AS ATM_ID,
DTD.ACID,
G.FORACID,
DTD.REF_NUM,
DTD.TRAN_ID,
DTD.TRAN_DATE,
DTD.VALUE_DATE,
DTD.TRAN_PARTICULAR,
DTD.TRAN_RMKS,
DTD.TRAN_AMT,
SYSDATE_ORA(),
DTD.PSTD_DATE,
DTD.PSTD_FLG,
G.CUSTID,
NULL AS PROC_FLG,
DTD.PSTD_USER_ID,
DTD.ENTRY_USER_ID,
G.schemecode as SCODE
FROM DAILY_TRAN_DETAIL_TABLE2 DTD
JOIN ods_gam G
ON DTD.ACID = G.ACID
where substr(DTD.TRAN_PARTICULAR,1,3) rlike '(PUR|POS).*'
AND DTD.PART_TRAN_TYPE = 'D'
AND DTD.DEL_FLG <> 'Y'
AND DTD.PSTD_FLG = 'Y'
AND G.schemecode IN ('SBPRV','SBPRS','WSSTF','BGFRN','NREPV','NROPV','BSNRE','BSNRO')
AND  (SUBSTR(split(DTD.TRAN_RMKS,'/')[0],1,6) IN ('405997','406228','406229','415527','415528','417917','417918','418210','421539','421572','432198','435736','450502','450503','450504','468805','469190','469191','469192','474856','478286','478287','486292','490222','490223','490254','512932','512932','514833','522346','522352','524458','526106','526701','527114','527479','529608','529615','529616','532731','532734','533102','534680','536132','536610','536621','539149','539158','549751','557654','607118','607407','607445','607529','652189','652190','652157') OR   SUBSTR(split(DTD.TRAN_RMKS,'/')[0],1,8)  IN ('53270200','53270201','53270202','60757401','60757402') )
limit 50;

Have you tried running it? Hive supports most of Hive Functionality. You have to do something with UDF though as I understand SYSDATE_ORA() is the one in your query. — MaxNevermind, Oct 09 '16 at 09:45
i mean i want to reduce the sql using spark functions like Map, Filter etc. so it increases the performance of the query right ? — Sree Eedupuganti, Oct 09 '16 at 09:58
@SreeEedupuganti : Feel free to ask question if you have any check my answer with psuedocode — Ram Ghadiyaram, Oct 09 '16 at 17:22
I havent seen your question since you have not used @me to notify.. Question : "i mean i want to reduce the sql using spark functions like Map, Filter etc. so it increases the performance of the query right ? " Answer : yes it will increase performance with DataFrame approach... if you are doing computations on data like mathematical calculation to derive column[s] then... it will increase. In your case you have only string operation in the query. That's why you can consider MaxNevermind approach by doing small poc. Hope that helps. if you care, please vote-up/accept answer as owner. — Ram Ghadiyaram, Dec 06 '16 at 08:29

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

Query is lengthy to write code for above, I won't attempt to write code here, But I would offer DataFrames approach.

which has flexibility to implement above query Using DataFrame , Column operations like filter,withColumn(if you want to convert/apply hive UDF to scala function/udf) , cast for casting datatypes etc.. Recently I've done this and its performant. Below is the psuedo code in Scala

   val df1 = hivecontext.sql ("select * from ods_gam").as("G")
   val df2 = hivecontext.sql("select * from DAILY_TRAN_DETAIL_TABLE2).as("DTD")

Now, join using your dataframes

val joinedDF = df1.join(df2 ,  df1("G.ACID") = df2("DTD.ACID"), "inner")
// now apply your string functions here...
joinedDF.withColumn or filter ,When otherwise ... blah.. blah here

Note : I think in your case udfs are not required, simple string functions would suffice.

Also have a look at DataFrameJoinSuite.scala which could be very useful for you...

Further details refer docs

Spark 1.5 :

DataFrame.html
All the dataframe column operations Column.html

If you are looking for sample code of UDF below is code snippet.

Construct Dummy Data

import util.Random
import org.apache.spark.sql.Row
implicit class Crossable[X](xs: Traversable[X]) {
  def cross[Y](ys: Traversable[Y]) = for { x <- xs; y <- ys } yield (x, y)
}
val students = Seq("John", "Mike","Matt")
val subjects = Seq("Math", "Sci", "Geography", "History")
val random = new Random(1)
val data =(students cross subjects).map{x  =>  Row(x._1, x._2,random.nextInt(100))}.toSeq
 
// Create Schema Object
import org.apache.spark.sql.types.{StructType, StructField, IntegerType, StringType}
val schema = StructType(Array(
            StructField("student", StringType, nullable=false),
            StructField("subject", StringType, nullable=false),
            StructField("score", IntegerType, nullable=false)
    ))
 
// Create DataFrame 
import org.apache.spark.sql.hive.HiveContext
val rdd = sc.parallelize(data)
val df = sqlContext.createDataFrame(rdd, schema)
 
// Define udf
import org.apache.spark.sql.functions.udf
def udfScoreToCategory=udf((score: Int) => {
        score match {
        case t if t >= 80 => "A"
        case t if t >= 60 => "B"
        case t if t >= 35 => "C"
        case _ => "D"
    }})
df.withColumn("category", udfScoreToCategory(df("score"))).show(10)

Note: pls consider @MaxNevermind 's thought as well you have simple sql with string function, do a small poc to know the performance benefits before proceed Spark SQL approach or hiveql approach. — Ram Ghadiyaram, Oct 12 '16 at 17:31

score 1 · Answer 2 · answered Oct 09 '16 at 11:13

Just try to use it as it is, you should benefit from this right away if you run this query with Hive on MapReduce before that, from there if you still would need to get better results you can analyze Query plan and optimize it further like using partitioning for example. Spark uses memory more heavily and beyond simple transformations is generally faster than MapReduce, Spark sql also uses Catalyst Optimizer, your query benefit from that too.

Considering your comment about "using spark functions like Map, Filter etc", map() just transforms data, but you just have string functions I don't think you will gain anything by rewriting them using .map(...), spark will do transformations for you, filter() if you can filter the input data, you can just rewrite query using sub queries and other sql capabilities.

Convert Hive Sql to Spark Sql

2 Answers2