How to use NamedDataFrame from spark job server

Question

I used SJS for my project and would like to know how NamedDataFrame from SJS works. My first program does this

val schemaString = "parm1:int,parm2:string,parm3:string,parm4:string,parm5:int,parm6:string,parm7:int,parm8:int"
val schema = StructType(schemaString.split(",").map(fieldName => StructField(fieldName.split(":")(0), getFieldTypeInSchema(fieldName.split(":")(1)),true)))   

val eDF1 = hive.applySchema(rowRDD1, schema)
this.namedObjects.getOrElseCreate("edf1", new NamedDataFrame(eDF1, true, StorageLevel.MEMORY_ONLY))

My second program does this to retrieve the DataFrame.

 val eDF1: Option[NamedDataFrame]   = this.namedObjects.get("eDF1")

Here I only able to use Option. How must I cast NamedDataFrame to a Spark DataFrame?

Is something of this equivalent available?

this.namedObjects.get[(Int,String,String,String,Int,String,Int,Int)]("eDF1")

Thanks!!

Edit1: To be precise, without SJS persistence, this could be done on the df

eDF1.filter(eDF1.col("parm1")%2!==0)

How can I perform the same operation from a saved namedObject?

score 0 · Answer 1 · answered Oct 04 '16 at 04:08

0

Take a look at https://github.com/spark-jobserver/spark-jobserver/blob/master/job-server-extras/src/spark.jobserver/NamedObjectsTestJob.scala for an example

answered Oct 04 '16 at 04:08

noorul

1,283
1
8
18

The above example does not say how to retrieve a Dataframe. Here are some lines from your example. Can you say how to retrieve df1 while keeping the StructType? val struct = StructType( StructField("i", IntegerType, true) :: StructField("b", BooleanType, false) :: Nil) val df = sqlContext.createDataFrame(rows(sc), struct) namedObjects.update("df1", NamedDataFrame(df, true, StorageLevel.MEMORY_AND_DISK)) – user1384205 Oct 04 '16 at 09:27

score 0 · Answer 2 · answered Oct 04 '16 at 14:32

The following works on NamedDataFrame

Job1

this.namedObjects.getOrElseCreate("df:esDF1", new NamedDataFrame(eDF1, true, StorageLevel.MEMORY_ONLY))

Job2

val NamedDataFrame(eDF1, _, _) = namedObjects.get[NamedDataFrame]("df:esDF1").get

Now i can operate on eDF1 on the second job as a spark dataframe.

How to use NamedDataFrame from spark job server

2 Answers2