2

i want to convert my Hive Sql to Spark Sql to test the performance of query. Here is my Hive Sql. Can anyone suggests me how to convert the Hive Sql to Spark Sql.

SELECT split(DTD.TRAN_RMKS,'/')[0] AS TRAB_RMK1,
split(DTD.TRAN_RMKS,'/')[1] AS ATM_ID,
DTD.ACID,
G.FORACID,
DTD.REF_NUM,
DTD.TRAN_ID,
DTD.TRAN_DATE,
DTD.VALUE_DATE,
DTD.TRAN_PARTICULAR,
DTD.TRAN_RMKS,
DTD.TRAN_AMT,
SYSDATE_ORA(),
DTD.PSTD_DATE,
DTD.PSTD_FLG,
G.CUSTID,
NULL AS PROC_FLG,
DTD.PSTD_USER_ID,
DTD.ENTRY_USER_ID,
G.schemecode as SCODE
FROM DAILY_TRAN_DETAIL_TABLE2 DTD
JOIN ods_gam G
ON DTD.ACID = G.ACID
where substr(DTD.TRAN_PARTICULAR,1,3) rlike '(PUR|POS).*'
AND DTD.PART_TRAN_TYPE = 'D'
AND DTD.DEL_FLG <> 'Y'
AND DTD.PSTD_FLG = 'Y'
AND G.schemecode IN ('SBPRV','SBPRS','WSSTF','BGFRN','NREPV','NROPV','BSNRE','BSNRO')
AND  (SUBSTR(split(DTD.TRAN_RMKS,'/')[0],1,6) IN ('405997','406228','406229','415527','415528','417917','417918','418210','421539','421572','432198','435736','450502','450503','450504','468805','469190','469191','469192','474856','478286','478287','486292','490222','490223','490254','512932','512932','514833','522346','522352','524458','526106','526701','527114','527479','529608','529615','529616','532731','532734','533102','534680','536132','536610','536621','539149','539158','549751','557654','607118','607407','607445','607529','652189','652190','652157') OR   SUBSTR(split(DTD.TRAN_RMKS,'/')[0],1,8)  IN ('53270200','53270201','53270202','60757401','60757402') )
limit 50;
Sree Eedupuganti
  • 440
  • 5
  • 15
  • 1
    Have you tried running it? Hive supports most of Hive Functionality. You have to do something with UDF though as I understand SYSDATE_ORA() is the one in your query. – MaxNevermind Oct 09 '16 at 09:45
  • 1
    Spark supports most of Hive Functionality I mean. – MaxNevermind Oct 09 '16 at 09:51
  • 1
    i mean i want to reduce the sql using spark functions like Map, Filter etc. so it increases the performance of the query right ? – Sree Eedupuganti Oct 09 '16 at 09:58
  • 1
    @SreeEedupuganti : Feel free to ask question if you have any check my answer with psuedocode – Ram Ghadiyaram Oct 09 '16 at 17:22
  • 1
    I havent seen your question since you have not used @me to notify.. Question : "i mean i want to reduce the sql using spark functions like Map, Filter etc. so it increases the performance of the query right ? " Answer : yes it will increase performance with DataFrame approach... if you are doing computations on data like mathematical calculation to derive column[s] then... it will increase. In your case you have only string operation in the query. That's why you can consider MaxNevermind approach by doing small poc. Hope that helps. if you care, please vote-up/accept answer as owner. – Ram Ghadiyaram Dec 06 '16 at 08:29

2 Answers2

2

Query is lengthy to write code for above, I won't attempt to write code here, But I would offer DataFrames approach.

which has flexibility to implement above query Using DataFrame , Column operations like filter,withColumn(if you want to convert/apply hive UDF to scala function/udf) , cast for casting datatypes etc.. Recently I've done this and its performant. Below is the psuedo code in Scala

   val df1 = hivecontext.sql ("select * from ods_gam").as("G")
   val df2 = hivecontext.sql("select * from DAILY_TRAN_DETAIL_TABLE2).as("DTD")
    

Now, join using your dataframes

val joinedDF = df1.join(df2 ,  df1("G.ACID") = df2("DTD.ACID"), "inner")
// now apply your string functions here...
joinedDF.withColumn or filter ,When otherwise ... blah.. blah here

Note : I think in your case udfs are not required, simple string functions would suffice.

Also have a look at DataFrameJoinSuite.scala which could be very useful for you...

Further details refer docs

Spark 1.5 :

Construct Dummy Data

import util.Random
import org.apache.spark.sql.Row
implicit class Crossable[X](xs: Traversable[X]) {
  def cross[Y](ys: Traversable[Y]) = for { x <- xs; y <- ys } yield (x, y)
}
val students = Seq("John", "Mike","Matt")
val subjects = Seq("Math", "Sci", "Geography", "History")
val random = new Random(1)
val data =(students cross subjects).map{x  =>  Row(x._1, x._2,random.nextInt(100))}.toSeq
 
// Create Schema Object
import org.apache.spark.sql.types.{StructType, StructField, IntegerType, StringType}
val schema = StructType(Array(
            StructField("student", StringType, nullable=false),
            StructField("subject", StringType, nullable=false),
            StructField("score", IntegerType, nullable=false)
    ))
 
// Create DataFrame 
import org.apache.spark.sql.hive.HiveContext
val rdd = sc.parallelize(data)
val df = sqlContext.createDataFrame(rdd, schema)
 
// Define udf
import org.apache.spark.sql.functions.udf
def udfScoreToCategory=udf((score: Int) => {
        score match {
        case t if t >= 80 => "A"
        case t if t >= 60 => "B"
        case t if t >= 35 => "C"
        case _ => "D"
    }})
df.withColumn("category", udfScoreToCategory(df("score"))).show(10)
Community
  • 1
  • 1
Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
  • 1
    Note: pls consider @MaxNevermind 's thought as well you have simple sql with string function, do a small poc to know the performance benefits before proceed Spark SQL approach or hiveql approach. – Ram Ghadiyaram Oct 12 '16 at 17:31
1

Just try to use it as it is, you should benefit from this right away if you run this query with Hive on MapReduce before that, from there if you still would need to get better results you can analyze Query plan and optimize it further like using partitioning for example. Spark uses memory more heavily and beyond simple transformations is generally faster than MapReduce, Spark sql also uses Catalyst Optimizer, your query benefit from that too.

Considering your comment about "using spark functions like Map, Filter etc", map() just transforms data, but you just have string functions I don't think you will gain anything by rewriting them using .map(...), spark will do transformations for you, filter() if you can filter the input data, you can just rewrite query using sub queries and other sql capabilities.

MaxNevermind
  • 2,784
  • 2
  • 25
  • 31