Could you please guide me on how to create and execute a machine learning models/statistical models (regression, Decision tree, K means clustering, Naive bayes, scorecard/linear/logistic regression etc. and GBM, GLM ) in Java/JVM based application (in production).
We have an ETL sort of Java based product where one can do most of data Preparation steps for machine learning, like data ingestion from JDBC, files, HDFS, No SQL etc., joins and aggregations etc.(which are required for Feature engineering) and now we want to add Analytics capabilities using machine learning/statistical modeling.
Right now, we are using JPMML- evaluator to score the models created in PMML format using R and python (and Knime) but it needs three separate and unconnected steps:- 1- first step for data preparation in our Java/JVM application and save the sampling data (training and test) data in csv file or in DB, - 2- Create a machine learning Model in R and python (and Knime) and export it in PMML 4.2 format - 3- Import/deploy the PMML in our Java based application and use JPMML evaluator to execute it in production.
I am sure it's a common problem in machine learning as generally in Production JAVA is preferred over Python or R. Could you suggest what is the better approach(s) to create as well as execute a python/scikit based machine learning model in JVM based application.
What are your thought to achieve the steps # 2 and #3 more seamlessly in a JVM based application, without compromising performance and usability:-
1- Call a java program which internally calls the python scikit script (under the hood) to create a model in PMML and then use JPMML evaluator. It will pretend to the user that he is in a single JVM based application (better usability). I am not sure what are the limitations and short coming of using PMML as not all features are supported in jpmml-sklearn. 2- Call a java program which internally calls the python script and do the model creation as well as execution in an external python environment and serialized the model and the results in a file/csv or in memory DB (or cache, like hazelcast) from where the parent Java application will fetch the results etc.. I researched that I can’t use Jython for executing Sci-kit models. 3- Can I use Jep (Embed Python in Java) to embed Cpython in JVM ? Does anybody tried it for sci-kit models?
Alternatively, I should explore to use Mahout or weka - java based machine learning libraries in my JVM based application. (I need to support both windows and non-windows platforms)
I am also exploring H2Oai which is java based. Does anybody tried it.