I am new to Spark/Scala. I am trying to use Spark in the following scenario -
- There is an input transaction table
- reference or lookup tables
All tables are stored in HBASE and accessed in Spark via Phoenix jdbc driver.
The lookup tables can be grouped as primary and others, primary table decides which additional lookup tables might be needed for processing the input transaction table.
I create a DataFrame on primary look-up table, identify the the remaining lookup tables that needs to be loaded as shown below -
var tableColumns = new HashMap[String, Set[String]] with MultiMap[String, String]
sqlContext.sql("select distinct LOOKUP_TABLE,LOOKUP_COLUMN from primaryLookUp").collect()
.map { x => tableColumns.addBinding(x.getString(0), x.getString(1)) }
So tableColumns look something like [LookUpTable1->[Col1, Col2],LookUpTable4->[Col11]].
Next I iterate over this list and try to create DataFrames for the remaining lookup tables and store those in a HashMap (Mutable), here is the code -
var lookupDFs = new HashMap[String, DataFrame]
var df: DataFrame = null
tableColumns.foreach { keyVal =>
sqlContext.read.jdbc("myjdbcurl",keyVal._1,properties.registerTempTable(keyVal._1)
keyVal._2.foreach { value =>
df = "select distinct " + value + " from " + keyVal._1
df.show
lookupDFs.put(keyVal._1 + "##" + value, df)
}
}
But immediately after this loop if I try to access the dataframes from lookupDFs it throws Null pointer exception.
However, if I alter the Collection to store String literal instead of DataFrame I face no issue at all. What am I missing here?
Let me know if any additional information is required, I am using Spark 1.6.2, Phoenix 4.4.0, HBase 1.1.x