1

I have been using cross validation process in order to train a Naive Bayes Model and I realize that it uses kFold method to get the random sampling data in order to create the folds. This method return an Array[(RDD[T], RDD[T])] of tuples, which I think are the set of different combination of the folds for training and testing.

My question is whether there is any specific reason because the API does not allow you to define your own array of folds. I need that functionality and I am guessing that I will have to write my own CrossValidator class in order to support that capability. I am also open to receive advices.

dbustosp
  • 4,208
  • 25
  • 46
  • I face the same problem, maybe you should open an issue on Jira – Borbag Jun 24 '16 at 10:06
  • Also if you have implented something that does the job, I am very interested! – Borbag Jun 24 '16 at 10:55
  • Hello @Borbag, I actually have something implemented. It's not elegant actually but it worked for that specific use case. – dbustosp Jun 24 '16 at 15:00
  • As long as it does the job, and it's faster to understand how to use it than to code it by myself I would be happy to use it! – Borbag Jun 24 '16 at 15:06
  • @Borbag are you interested in working together on this? I've been so busy these days. I can upload the code to github and you can go ahead and modify it. Does that work for you? – dbustosp Jul 04 '16 at 19:02
  • It's been a long time so I did my own thing. Quick and dirty tho. Maybe we can work together to modify the spark code so it will accept a custom folding function and submit a PR? – Borbag Jul 04 '16 at 19:05
  • I asked a request to include me in Spark user list in order to ask about that, but I have not received any answer yet. Check this out: https://issues.apache.org/jira/browse/SPARK-16206 – dbustosp Jul 04 '16 at 19:15

0 Answers0