I am experimenting with spark.ml library and the the pipelines capability. There seems to be a limitation in using SQL with splits (e.g. for train and test):
- It is nice that spark.ml works off of schema rdd, but there is no easy way to randomly split schema rdd in test and train set. I can use randomSplit(0.6,0.4) but that gives back an array of RDD that loses the schema. I can force a case class on it and covert it back to schema RDD, but I have a lot of features. I used filter and used some basic partitioning condition based on one of my iid feature). Any suggestions of what else can be done?
Regarding the generated model:
- How do I access the model weights? The lr optimizer and lr model internally has weights but it is unclear how to us them.