3

I am new to Spark/Scala. I am trying to use Spark in the following scenario -

  • There is an input transaction table
  • reference or lookup tables

All tables are stored in HBASE and accessed in Spark via Phoenix jdbc driver.

The lookup tables can be grouped as primary and others, primary table decides which additional lookup tables might be needed for processing the input transaction table.

I create a DataFrame on primary look-up table, identify the the remaining lookup tables that needs to be loaded as shown below -

 var tableColumns = new HashMap[String, Set[String]] with MultiMap[String, String]

sqlContext.sql("select distinct LOOKUP_TABLE,LOOKUP_COLUMN from primaryLookUp").collect()
  .map { x => tableColumns.addBinding(x.getString(0), x.getString(1)) }

So tableColumns look something like [LookUpTable1->[Col1, Col2],LookUpTable4->[Col11]].

Next I iterate over this list and try to create DataFrames for the remaining lookup tables and store those in a HashMap (Mutable), here is the code -

var lookupDFs = new HashMap[String, DataFrame]

var df: DataFrame = null
tableColumns.foreach { keyVal =>

  sqlContext.read.jdbc("myjdbcurl",keyVal._1,properties.registerTempTable(keyVal._1)
  keyVal._2.foreach { value =>
    df = "select distinct " + value + " from " + keyVal._1
    df.show
    lookupDFs.put(keyVal._1 + "##" + value, df)
  }
}

But immediately after this loop if I try to access the dataframes from lookupDFs it throws Null pointer exception.

However, if I alter the Collection to store String literal instead of DataFrame I face no issue at all. What am I missing here?

Let me know if any additional information is required, I am using Spark 1.6.2, Phoenix 4.4.0, HBase 1.1.x

mbaxi
  • 1,301
  • 1
  • 8
  • 28
  • You probably need to use broadcast variables, see a minimal example [here](http://stackoverflow.com/questions/40673773/how-to-use-a-broadcast-collection-in-a-udf/) – mtoto Dec 05 '16 at 12:39
  • the problem is tableColumns content may change with every run, now I have to dynamically generate DF's for n tables for which I will need to iterate. Out of iteration loop the DF becomes null, is there any other way to avoid loop populate all DFs and then broadcast them? – mbaxi Dec 05 '16 at 13:54
  • I won't recommend holding dataframe in such a way, but, if you want to. Try using df.cache() before df.show – Abhishek Anand Dec 05 '16 at 17:45
  • @AbhishekAnand - I tried using cache as well but as soon as out of loop the collection becomes null, I created a global variable instead and it seem to work but I still need to test it well – mbaxi Dec 06 '16 at 07:41

0 Answers0