I simply would like to tune my XGBoost with Ray Tune locally. However, my "actor" always dies unexpectedly before the ray trail could even start.
I tried different code variants and also checked memory consumption/availability (I have 48 CPUs and 256 GB RAM, so there should be no memory bottleneck).
Find below the relevant code snippets. Maybe someone could help and spot a mistake or give a hint what else I could try?
def train_xgb(config, data):
clf = xgb.XGBClassifier(**config)
clf.fit(data[0],data[1])
predictions = clf.predict(data[2])
prec = precision_score(data[3], predictions)
tune.report(precision=prec)
return {"precision": prec}
config = {
"objective": "multi:softmax",
"eval_metric": ["logloss", "error"],
"n_estimators": tune.choice([5, 10, 15]),
"max_depth": 5,
"learning_rate":0.01,
"num_class":2,
"booster":'gbtree',
"gamma":0,
"base_score":0.5,
"random_state":42,
"verbosity":1,
"n_jobs":1
}
trainable = tune.with_parameters(train_xgb, config=config, data=(X_train,y_train,X_valid,y_valid))
try:
ray.init(include_dashboard=False) # Tried with and without this line of code
analysis = tune.run(
trainable,
resources_per_trial={"cpu": 4},
config=config,
chdir_to_trial_dir=False,
num_samples=3)
except Exception as e:
print("An error occurred:", e)
Error Message:
2023-08-29 08:03:34,335 INFO worker.py:1544 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 2023-08-29 08:03:35,193 ERROR trial_runner.py:1062 -- Trial train_xgb_8d72c_00000: Error processing event. ray.tune.error._TuneNoNextExecutorEventError: Traceback (most recent call last): File "/home/.../miniconda3/envs/xgb-impl/lib/python3.10/site-packages/ray/tune/execution/ray_trial_executor.py", line 1276, in get_next_executor_event future_result = ray.get(ready_future) File "/home/.../miniconda3/envs/xgb-impl/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper return func(*args, **kwargs) File "/home/.../miniconda3/envs/xgb-impl/lib/python3.10/site-packages/ray/_private/worker.py", line 2382, in get raise value ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. class_name: wrap_function..ImplicitFunc actor_id: 53ebd3abb1a0bfb346c3d56c01000000 namespace: 89f3dadd-4b17-422e-b640-19e35b2b81f2 The actor is dead because its owner has died. Owner Id: 01000000ffffffffffffffffffffffffffffffffffffffffffffffff Owner Ip address: 134.95.55.205 Owner worker exit type: SYSTEM_ERROR Worker exit detail: Owner's node has crashed. The actor never ran - it was cancelled before it started running.
(raylet) [2023-08-29 08:03:35,179 E 3121452 3121496] (raylet) agent_manager.cc:135: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. Agent can fail when
(raylet) - The version of
grpcio
doesn't follow Ray's requirement. Agent can segfault with the incorrectgrpcio
version. Check the grpcio versionpip freeze | grep grpcio
.
(raylet) - The agent failed to start because of unexpected error or port conflict. Read the log
cat /tmp/ray/session_latest/dashboard_agent.log
. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure.
(raylet) - The agent is killed by the OS (e.g., out of memory). [2023-08-29 08:03:35,332 E 3115918 3121532] core_worker.cc:569: :info_message: Attempting to recover 2 lost objects by resubmitting their tasks. To disable object reconstruction, set @ray.remote(max_retries=0). 2023-08-29 08:03:51,198 WARNING tune.py:146 -- Stop signal received (e.g. via SIGINT/Ctrl+C), ending Ray Tune run. This will try to checkpoint the experiment state one last time. Press CTRL+C (or send SIGINT/SIGKILL/SIGTERM) to skip.