I have used XGBoost for my model. i have noticed that h2o cluster not share memory while on this model process. master A server RAM utilization is very high and master B RAM utilization is very low. i checked h2o logs on both servers and noticed master A log file continuously updating while on model processing but master B log file not updating. it shows cluster created logs only
Some times while on model processing master A h2o jar down due to high memory usage.
I'm using h2o-3.36.1.1 version and created two node cluster. Cluster successfully created and logged cluster details on log file.
i have check master A & B connectivity and did curl with both side. all work fine and cluster is work well.
- H2O_cluster_uptime: 15 mins 14 secs
- H2O_cluster_timezone: Asia/Colombo
- H2O_data_parsing_timezone: UTC
- H2O_cluster_version: 3.36.1.1
- H2O_cluster_version_age: 11 months and 28 days !!!
- H2O_cluster_name: XXXXXX
- H2O_cluster_total_nodes: 2
- H2O_cluster_free_memory: 43.36 Gb
- H2O_cluster_total_cores: 30
- H2O_cluster_allowed_cores: 30
- H2O_cluster_status: locked,
- healthy H2O_connection_url: http://localhost:54321
- H2O_connection_proxy: {"http": null, "https": null}
- H2O_internal_security: False
- Python_version: 3.7.11 final
Could anyone please help me to troubleshoot these issues.
Why both servers not share server resources while on model processing ?
Why master B h2o log not update ?
Why master A h2o jar down on high memory usage ?
Master A log
main INFO water.default: Open H2O Flow in your web browser: http://xxx.xxx.xxx.xx:54321
main INFO water.default:
FJ-126-15 INFO water.default: Cloud of size 2 formed [master01.user.com/xxx.xxx.xxx.xx:54321, master02.user.com/xxx.xxx.xxx.xx:54321]
058452-166 INFO water.default: GET /3/Metadata/schemas/CloudV3, parms: {}
058452-166 INFO water.default: Locking cloud to new members, because water.api.schemas3.MetadataV3
4058452-14 INFO water.default: GET /3/Metadata/schemas/H2OErrorV3, parms: {}
4058452-15 INFO water.default: GET /3/Metadata/schemas/H2OModelBuilderErrorV3, parms: {}
4058452-18 INFO water.default: POST /4/sessions, parms: {}
4058452-16 INFO water.default: POST /99/Rapids, parms: {ast=(setTimeZone "UTC"), session_id=_sid_a391}
4058452-13 INFO water.default: DELETE /3/DKV, parms: {}
4058452-13 INFO water.default: Removing all objects
4058452-13 INFO water.default: Finished removing objects
4058452-12 INFO water.default: DELETE /3/DKV, parms: {}
4058452-12 INFO water.default: Removing all objects
4058452-12 INFO water.default: Finished removing objects
058452-170 INFO water.default: DELETE /3/DKV, parms: {}
058452-170 INFO water.default: Removing all objects
058452-170 INFO water.default: Finished removing objects
4058452-14 INFO water.default: GET /3/Metadata/schemas/CloudV3, parms: {}
058452-169 INFO water.default: GET /3/Metadata/schemas/H2OErrorV3, parms: {}
058452-166 INFO water.default: GET /3/Metadata/schemas/H2OModelBuilderErrorV3, parms: {}
4058452-19 INFO water.default: POST /4/sessions, parms: {}
4058452-18 INFO water.default: POST /99/Rapids, parms: {ast=(setTimeZone "UTC"), session_id=_sid_bfac}
058452-170 INFO water.default: Reading byte InputStream into Frame:
058452-170 INFO water.default: frameKey: upload_bbcd4f6aeb3c1095e63f66a89cdd4756
058452-170 INFO water.default: totalChunks: 2
058452-170 INFO water.default: totalBytes: 4404663
058452-170 INFO water.default: Success.
058452-167 INFO water.default: POST /3/ParseSetup, parms: {single_quotes=False, source_frames=["upload_bbcd4f6aeb3c1095e63f66a89cdd4756"], check_header=0}
058452-169 INFO water.default: Total file size: 4.2 MB
058452-169 INFO water.default: Parse chunk size 4194304
FJ-1-15 INFO water.default: Parse result for Key_Frame__upload_bbcd4f6aeb3c1095e63f66a89cdd4756.hex (2023 rows, 436 columns):
FJ-1-15 INFO water.default: ColV2 type min max mean sigma NAs constant cardinality
FJ-1-15 INFO water.default: COL1: factor 011022232 YA9854024 1334
FJ-1-15 INFO water.default: COL2: numeric 2019.00 2020.00 2019.70 0.457960
FJ-1-15 INFO water.default: COL3: numeric 1.00000 12.0000 6.07860 2.82287
FJ-1-15 INFO water.default: COL4: factor |00011000813 |09988000074 1334
FJ-1-15 INFO water.default: COL5: factor CUST NAME CUSTOMER 2
FJ-1-15 INFO water.default: COL6: numeric 1.14005e+08 4.10024e+08 2.96146e+08 4.57328e+07
FJ-1-15 INFO water.default: COL7: numeric 10000.0 30000.0 28294.6 5573.93 3
FJ-1-15 INFO water.default: COL8: factor USD 4
FJ-1-15 INFO water.default: COL9: factor 927 RM17 20
FJ-1-15 INFO water.default: COL10: factor NO YES 2
FJ-1-15 INFO water.default: Additional column information only sent to log file...
FJ-1-15 INFO water.default: COL11: numeric -1.00000 175.250 1.07602 5.07740
FJ-1-15 INFO water.default: COL12: numeric -1.00000 97.2262 0.447662 3.19167
FJ-1-15 INFO water.default: COL13: numeric -1.00000 124.206 1.03933 3.94221
FJ-1-15 INFO water.default: response_class: factor 1A to_be_filled 5
FJ-1-15 INFO water.default: response_class_5: factor 1B 1B1 2
FJ-1-15 INFO water.default: response_class_4: factor 1A NON_PERFORME 4
FJ-1-15 INFO water.default: response_class_3: factor 1A NON_PERFORME 4
FJ-1-15 INFO water.default: response_class_2: factor 1A NON_PERFORME 4
FJ-1-15 INFO water.default: response_class_1: factor 1A NON_PERFORME 4
FJ-1-15 INFO water.default: subset: factor test train 2
FJ-1-15 INFO water.default: Chunk compression summary:
FJ-1-15 INFO water.default: Chunk Type Chunk Name Count Count Percentage Size Size Percentage
FJ-1-15 INFO water.default: C0L Constant long 74 8.486 % 5.8 KB 0.207 %
FJ-1-15 INFO water.default: CBS Binary 19 2.179 % 4.4 KB 0.159 %
FJ-1-15 INFO water.default: CXI Sparse Integers 80 9.174 % 25.0 KB 0.897 %
FJ-1-15 INFO water.default: CXF Sparse Reals 50 5.734 % 48.9 KB 1.753 %
FJ-1-15 INFO water.default: C1 1-Byte Integers 7 0.803 % 11.8 KB 0.423 %
FJ-1-15 INFO water.default: C1N 1-Byte Integers (w/o NAs) 92 10.550 % 104.0 KB 3.731 %
FJ-1-15 INFO water.default: C1S 1-Byte Fractions 142 16.284 % 118.4 KB 4.245 %
FJ-1-15 INFO water.default: C2 2-Byte Integers 72 8.257 % 231.7 KB 8.309 %
FJ-1-15 INFO water.default: C2S 2-Byte Fractions 18 2.064 % 22.9 KB 0.822 %
FJ-1-15 INFO water.default: C4 4-Byte Integers 50 5.734 % 109.1 KB 3.913 %
FJ-1-15 INFO water.default: C4S 4-Byte Fractions 127 14.564 % 360.5 KB 12.925 %
FJ-1-15 INFO water.default: C8 8-byte Integers 1 0.115 % 15.0 KB 0.539 %
FJ-1-15 INFO water.default: CUD Unique Reals 5 0.573 % 13.2 KB 0.472 %
FJ-1-15 INFO water.default: C8D 64-bit Reals 135 15.482 % 1.7 MB 61.606 %
FJ-1-15 INFO water.default: Frame distribution summary:
FJ-1-15 INFO water.default: Size Number of Rows Number of Chunks per Column Number of Chunks
Master B
main INFO water.default: H2O started in 4906ms
main INFO water.default:
main INFO water.default: Open H2O Flow in your web browser: http://xxx.xxx.xxx.xx:54321
main INFO water.default:
FJ-126-15 INFO water.default: Cloud of size 2 formed [master01.user.com/xxx.xxx.xxx.xx:54321, master02.user.com/xxx.xxx.xxx.xx:54321]
FJ-123-15 INFO water.default: Locking cloud to new members, because Class Id=56
FJ-2-15 INFO water.default: Key upload_bbcd4f6aeb3c1095e63f66a89cdd4756 will be parsed using method DistributedParse.
FJ-2-21 INFO water.default: Key upload_902bcdd31a4aea9f65690f1bc6074886 will be parsed using method DistributedParse.