1

H2O in spark cluster mode giving different predictions from spark local mode. H2O in spark local is giving better than spark cluster why it is happening ,can you help me? Tell me whether it's H2O behaviour. Two Data set are being used. One for training the model and another for scoring.
trainingData.csv : 1.8MB (number of rows are 2211),
testingData.csv : 1.8MB (number of rows are 2211),
Driver Memory : 1G,
Executors Memory: 1G,
Number Of Executors : 1
The following command is being used over cluster :=>
nohup /usr/hdp/current/spark2-client/bin/spark-submit --class com.inn.sparkrunner.h2o.GradientBoostingAlgorithm --master yarn --driver-memory 1G --executor-memory 1G --num-executors 1 --deploy-mode cluster spark-runner-1.0.jar > tool.log &

1)Main Method

    public static void main(String args[]) {   
              SparkSession sparkSession = getSparkSession();
              H2OContext h2oContext = getH2oContext(sparkSession);
              UnseenDataTestDRF(sparkSession, h2oContext);  
}

2)h2o context is being created.

    private static H2OContext getH2oContext(SparkSession sparkSession) {  
      H2OConf h2oConf = new H2OConf(sparkSession.sparkContext()).setInternalClusterMode();
    H2OContext orCreate = H2OContext.getOrCreate(sparkSession.sparkContext(), h2oConf);    
                     return orCreate;  
}          

3)spark session is being created.

    public static SparkSession getSparkSession() {  
    SparkSession spark = SparkSession.builder().appName("Java Spark SQL basic example").master("yarn")
            .getOrCreate();  
    return spark;  
}    

4)Setting GBM parameters.

    private static GBMParameters getGBMParam(H2OFrame asH2OFrame) {     
    GBMParameters gbmParam = new GBMParameters();           
    gbmParam._response_column = "high";      
    gbmParam._train = asH2OFrame._key;      
    gbmParam._ntrees = 10;      
    gbmParam._seed = 1;    
    return gbmParam;           
}    
poojanavin
  • 31
  • 4
  • Did you set the same seed in both cases? – Erin LeDell Mar 30 '18 at 19:28
  • 1
    H2O is always running in cluster mode, even if just one machine, e.g. localhost, is in the cluster. Can you give more information about the configuration (memory, number of machines, number of cores in each machine) of the two clusters you are comparing, and also how big your data is? Also what is the metric score for the two? Does it vary randomly run to run, or is one cluster consistently always better than the other? (That ties in to Erin's question about setting a seed.) – Darren Cook Mar 31 '18 at 09:02
  • @ErinLeDell whether seed parameter is used or not in GBM algorithm(sparkling water) spark cluster mode giving different predictions from spark local mode. why it's happening , can you help me? previous my two comment to this problem will show how i m running the code. – poojanavin Apr 03 '18 at 11:27
  • @poojanavin Can you edit your question to include the content of your comments - it will be easier to read, so you are more likely to get an answer. I'd also include the H2O versions of the two setups (if they are different that could be an explanation). – Darren Cook Apr 04 '18 at 07:19
  • @DarrenCook As you said , i edited my question – poojanavin Apr 04 '18 at 12:02

0 Answers0