0

I have used XGBoost for my model. i have noticed that h2o cluster not share memory while on this model process. master A server RAM utilization is very high and master B RAM utilization is very low. i checked h2o logs on both servers and noticed master A log file continuously updating while on model processing but master B log file not updating. it shows cluster created logs only

Some times while on model processing master A h2o jar down due to high memory usage.

I'm using h2o-3.36.1.1 version and created two node cluster. Cluster successfully created and logged cluster details on log file.

i have check master A & B connectivity and did curl with both side. all work fine and cluster is work well.

  • H2O_cluster_uptime: 15 mins 14 secs
  • H2O_cluster_timezone: Asia/Colombo
  • H2O_data_parsing_timezone: UTC
  • H2O_cluster_version: 3.36.1.1
  • H2O_cluster_version_age: 11 months and 28 days !!!
  • H2O_cluster_name: XXXXXX
  • H2O_cluster_total_nodes: 2
  • H2O_cluster_free_memory: 43.36 Gb
  • H2O_cluster_total_cores: 30
  • H2O_cluster_allowed_cores: 30
  • H2O_cluster_status: locked,
  • healthy H2O_connection_url: http://localhost:54321
  • H2O_connection_proxy: {"http": null, "https": null}
  • H2O_internal_security: False
  • Python_version: 3.7.11 final

Could anyone please help me to troubleshoot these issues.

Why both servers not share server resources while on model processing ?

Why master B h2o log not update ?

Why master A h2o jar down on high memory usage ?

Master A log

            main  INFO water.default: Open H2O Flow in your web browser: http://xxx.xxx.xxx.xx:54321
        main  INFO water.default: 
   FJ-126-15  INFO water.default: Cloud of size 2 formed [master01.user.com/xxx.xxx.xxx.xx:54321, master02.user.com/xxx.xxx.xxx.xx:54321]
  058452-166  INFO water.default: GET /3/Metadata/schemas/CloudV3, parms: {}
  058452-166  INFO water.default: Locking cloud to new members, because water.api.schemas3.MetadataV3
  4058452-14  INFO water.default: GET /3/Metadata/schemas/H2OErrorV3, parms: {}
  4058452-15  INFO water.default: GET /3/Metadata/schemas/H2OModelBuilderErrorV3, parms: {}
  4058452-18  INFO water.default: POST /4/sessions, parms: {}
  4058452-16  INFO water.default: POST /99/Rapids, parms: {ast=(setTimeZone "UTC"), session_id=_sid_a391}
  4058452-13  INFO water.default: DELETE /3/DKV, parms: {}
  4058452-13  INFO water.default: Removing all objects
  4058452-13  INFO water.default: Finished removing objects
  4058452-12  INFO water.default: DELETE /3/DKV, parms: {}
  4058452-12  INFO water.default: Removing all objects
  4058452-12  INFO water.default: Finished removing objects
  058452-170  INFO water.default: DELETE /3/DKV, parms: {}
  058452-170  INFO water.default: Removing all objects
  058452-170  INFO water.default: Finished removing objects
  4058452-14  INFO water.default: GET /3/Metadata/schemas/CloudV3, parms: {}
  058452-169  INFO water.default: GET /3/Metadata/schemas/H2OErrorV3, parms: {}
  058452-166  INFO water.default: GET /3/Metadata/schemas/H2OModelBuilderErrorV3, parms: {}
  4058452-19  INFO water.default: POST /4/sessions, parms: {}
  4058452-18  INFO water.default: POST /99/Rapids, parms: {ast=(setTimeZone "UTC"), session_id=_sid_bfac}
  058452-170  INFO water.default: Reading byte InputStream into Frame:
  058452-170  INFO water.default:     frameKey:    upload_bbcd4f6aeb3c1095e63f66a89cdd4756
  058452-170  INFO water.default:     totalChunks: 2
  058452-170  INFO water.default:     totalBytes:  4404663
  058452-170  INFO water.default:     Success.
  058452-167  INFO water.default: POST /3/ParseSetup, parms: {single_quotes=False, source_frames=["upload_bbcd4f6aeb3c1095e63f66a89cdd4756"], check_header=0}
  058452-169  INFO water.default: Total file size: 4.2 MB
  058452-169  INFO water.default: Parse chunk size 4194304
     FJ-1-15  INFO water.default: Parse result for Key_Frame__upload_bbcd4f6aeb3c1095e63f66a89cdd4756.hex (2023 rows, 436 columns):
     FJ-1-15  INFO water.default:                               ColV2    type          min          max         mean        sigma         NAs constant cardinality
     FJ-1-15  INFO water.default:                                COL1:  factor    011022232    YA9854024                                                  1334
     FJ-1-15  INFO water.default:                      COL2: numeric      2019.00      2020.00      2019.70     0.457960                            
     FJ-1-15  INFO water.default:                     COL3: numeric      1.00000      12.0000      6.07860      2.82287                            
     FJ-1-15  INFO water.default:                         COL4:  factor |00011000813 |09988000074                                                  1334
     FJ-1-15  INFO water.default:                       COL5:  factor    CUST NAME     CUSTOMER                                                     2
     FJ-1-15  INFO water.default:                         COL6: numeric  1.14005e+08  4.10024e+08  2.96146e+08  4.57328e+07                            
     FJ-1-15  INFO water.default:                    COL7: numeric      10000.0      30000.0      28294.6      5573.93           3                
     FJ-1-15  INFO water.default:                     COL8:  factor                       USD                                                     4
     FJ-1-15  INFO water.default:                              COL9:  factor          927         RM17                                                    20
     FJ-1-15  INFO water.default:               COL10:  factor           NO          YES                                                     2
     FJ-1-15  INFO water.default: Additional column information only sent to log file...
     FJ-1-15  INFO water.default:                COL11: numeric     -1.00000      175.250      1.07602      5.07740                            
     FJ-1-15  INFO water.default:                COL12: numeric     -1.00000      97.2262     0.447662      3.19167                            
     FJ-1-15  INFO water.default:                COL13: numeric     -1.00000      124.206      1.03933      3.94221                            
     FJ-1-15  INFO water.default:                      response_class:  factor           1A to_be_filled                                                     5
     FJ-1-15  INFO water.default:                    response_class_5:  factor           1B          1B1                                                     2
     FJ-1-15  INFO water.default:                    response_class_4:  factor           1A NON_PERFORME                                                     4
     FJ-1-15  INFO water.default:                    response_class_3:  factor           1A NON_PERFORME                                                     4
     FJ-1-15  INFO water.default:                    response_class_2:  factor           1A NON_PERFORME                                                     4
     FJ-1-15  INFO water.default:                    response_class_1:  factor           1A NON_PERFORME                                                     4
     FJ-1-15  INFO water.default:                              subset:  factor         test        train                                                     2
     FJ-1-15  INFO water.default: Chunk compression summary:
     FJ-1-15  INFO water.default:   Chunk Type                 Chunk Name       Count  Count Percentage        Size  Size Percentage
     FJ-1-15  INFO water.default:          C0L              Constant long          74           8.486 %      5.8 KB          0.207 %
     FJ-1-15  INFO water.default:          CBS                     Binary          19           2.179 %      4.4 KB          0.159 %
     FJ-1-15  INFO water.default:          CXI            Sparse Integers          80           9.174 %     25.0 KB          0.897 %
     FJ-1-15  INFO water.default:          CXF               Sparse Reals          50           5.734 %     48.9 KB          1.753 %
     FJ-1-15  INFO water.default:           C1            1-Byte Integers           7           0.803 %     11.8 KB          0.423 %
     FJ-1-15  INFO water.default:          C1N  1-Byte Integers (w/o NAs)          92          10.550 %    104.0 KB          3.731 %
     FJ-1-15  INFO water.default:          C1S           1-Byte Fractions         142          16.284 %    118.4 KB          4.245 %
     FJ-1-15  INFO water.default:           C2            2-Byte Integers          72           8.257 %    231.7 KB          8.309 %
     FJ-1-15  INFO water.default:          C2S           2-Byte Fractions          18           2.064 %     22.9 KB          0.822 %
     FJ-1-15  INFO water.default:           C4            4-Byte Integers          50           5.734 %    109.1 KB          3.913 %
     FJ-1-15  INFO water.default:          C4S           4-Byte Fractions         127          14.564 %    360.5 KB         12.925 %
     FJ-1-15  INFO water.default:           C8            8-byte Integers           1           0.115 %     15.0 KB          0.539 %
     FJ-1-15  INFO water.default:          CUD               Unique Reals           5           0.573 %     13.2 KB          0.472 %
     FJ-1-15  INFO water.default:          C8D               64-bit Reals         135          15.482 %      1.7 MB         61.606 %
     FJ-1-15  INFO water.default: Frame distribution summary:
     FJ-1-15  INFO water.default:                             Size  Number of Rows  Number of Chunks per Column  Number of Chunks

Master B

    main  INFO water.default: H2O started in 4906ms
     main  INFO water.default: 
     main  INFO water.default: Open H2O Flow in your web browser: http://xxx.xxx.xxx.xx:54321
     main  INFO water.default: 
FJ-126-15  INFO water.default: Cloud of size 2 formed [master01.user.com/xxx.xxx.xxx.xx:54321, master02.user.com/xxx.xxx.xxx.xx:54321]
FJ-123-15  INFO water.default: Locking cloud to new members, because Class Id=56
  FJ-2-15  INFO water.default: Key upload_bbcd4f6aeb3c1095e63f66a89cdd4756 will be parsed using method DistributedParse.
  FJ-2-21  INFO water.default: Key upload_902bcdd31a4aea9f65690f1bc6074886 will be parsed using method DistributedParse.
  • Can you share the logs from both h2o nodes? – Marek Novotny Apr 11 '23 at 16:13
  • @MarekNovotny Master A & B log shared, part of Master A and full of Mater B. – Nalinda Perera Apr 17 '23 at 07:15
  • It's strange to me that parser creates just two chunks for the input dataset. Can you share the command that imports the dataset? Also what additional parameters do you use for running h2o nodes. Do you run a H2o node inside docker? – Marek Novotny Apr 18 '23 at 09:45
  • called h2o.init() command, no additional params and not run on docker. just import as 'h2o.upload_file(path=files)' and i have used below params on H2OGridSearch param = {"min_rows": 5, "seed": -6, "score_tree_interval": 100, "stopping_metric": "deviance", "stopping_tolerance": 0.001, "max_bins": 256, "distribution": "tweedie"}, hyper_parameters = {'ntrees': [350], 'max_depth': [9], 'learn_rate': [0.15], 'sample_rate': [0.85],"min_split_improvement": [0.02], "reg_lambda": [1], "reg_alpha": [1]} grid_search = H2OGridSearch(model=H2OXGBoostEstimator(**param), hyper_params=hyper_parameters) – Nalinda Perera Apr 19 '23 at 07:48

0 Answers0