I have a script for finetuning a transformer which is based on this tutorial. I am running it on a remote SLURM based server. When I execute it interactively from the command line, it runs and produces the desired output. However, when I submit it as a batch job, I encounter an issue with the distribution of computational resources. The error emerges from the instantiation of TrainingArgs, which would be unhelpful to post here because these are just hyperparameters for the model I'm training. I have investigated the issue and found that the command line vs batched exceutions diverge on line 198 of accelerate.state:
elif get_int_from_env(["PMI_SIZE", "OMPI_COMM_WORLD_SIZE", "MV2_COMM_WORLD_SIZE", "WORLD_SIZE"], 1) > 1:
And the reason is that from the command line, this evaluates to False bc all envars evaluate to 1, while for the batched job, PMI_SIZE evaluates to 128, which passes the threshold of 1. This then leads to the code trying to create a TCPStore object, and then failing to due to login mismatch.
I have tried setting the MASTER_ADDR and MASTER_PORT variables to the ones I use to login to the server, but it only led to a connection timeout. Below is my traceback. I have also tried to run TrainingArguments with sagemaker and deepspeed, but encountered the same results.
So the question is, should I try to identify which correct arguments to give to TCPStore so that it logins successfully to the server? Or find a way where it is able to distribute the resources without trying to login?
Thanks!
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /users/home/korat/acode/finetune_classification.py:170 in <module> │
│ │
│ 167 │ │ df.to_csv(os.path.join(output_dir_base,f'results-{label_col}{a │
│ 168 │
│ 169 if __name__ == "__main__": │
│ ❱ 170 │ main() │
│ │
│ /users/home/korat/acode/finetune_classification.py:155 in main │
│ │
│ 152 │ │ i+=1 │
│ 153 │ │ if i>3: │
│ 154 │ │ │ break │
│ ❱ 155 │ │ loss = train_model(model_checkpoint, dataset, weight_decay=wei │
│ 156 │ │ │ │ │ │ │ │ │ adam_beta1=adam_beta1, │
│ 157 │ │ │ │ │ │ │ │ │ adam_beta2=adam_beta2, │
│ 158 │ │ │ │ │ │ │ │ │ adam_epsilon=adam_epsilon, │
│ │
│ /users/home/korat/acode/finetune_classification.py:86 in train_model │
│ │
│ 83 │ output_dir = output_dir_base + "/" + hyparam_comb_to_str(params) │
│ 84 │ │
│ 85 │ #get_int_from_env(["PMI_SIZE"], 1) = 128, and this is where the co │
│ ❱ 86 │ training_args = TrainingArguments( │
│ 87 │ │ │
│ 88 │ │ output_dir=output_dir, │
│ 89 │ │ evaluation_strategy=IntervalStrategy.STEPS, │
│ <string>:111 in __init__ │
│ │
│ /users/home/korat/.local/lib/python3.10/site-packages/transformers/training_ │
│ args.py:1340 in __post_init__ │
│ │
│ 1337 │ │ if ( │
│ 1338 │ │ │ self.framework == "pt" │
│ 1339 │ │ │ and is_torch_available() │
│ ❱ 1340 │ │ │ and (self.device.type != "cuda") │
│ 1341 │ │ │ and (get_xla_device_type(self.device) != "GPU") │
│ 1342 │ │ │ and (self.fp16 or self.fp16_full_eval) │
│ 1343 │ │ ): │
│ │
│ /users/home/korat/.local/lib/python3.10/site-packages/transformers/training_ │
│ args.py:1764 in device │
│ │
│ 1761 │ │ The device used by this process. │
│ 1762 │ │ """ │
│ 1763 │ │ requires_backends(self, ["torch"]) │
│ ❱ 1764 │ │ return self._setup_devices │
│ 1765 │ │
│ 1766 │ @property │
│ 1767 │ def n_gpu(self): │
│ │
│ /users/home/korat/.local/lib/python3.10/site-packages/transformers/utils/gen │
│ eric.py:54 in __get__ │
│ │
│ 51 │ │ attr = "__cached_" + self.fget.__name__ │
│ 52 │ │ cached = getattr(obj, attr, None) │
│ 53 │ │ if cached is None: │
│ ❱ 54 │ │ │ cached = self.fget(obj) │
│ 55 │ │ │ setattr(obj, attr, cached) │
│ 56 │ │ return cached │
│ 57 │
│ │
│ /users/home/korat/.local/lib/python3.10/site-packages/transformers/training_ │
│ args.py:1695 in _setup_devices │
│ │
│ 1692 │ │ │ del os.environ["ACCELERATE_USE_DEEPSPEED"] │
│ 1693 │ │ │ self._n_gpu = 1 │
│ 1694 │ │ else: │
│ ❱ 1695 │ │ │ self.distributed_state = PartialState(backend=self.ddp_ba │
│ 1696 │ │ │ self._n_gpu = 1 │
│ 1697 │ │ if not is_sagemaker_mp_enabled(): │
│ 1698 │ │ │ device = self.distributed_state.device │
│ │
│ /users/home/korat/.local/lib/python3.10/site-packages/accelerate/state.py:23 │
│ 8 in __init__ │
│ │
│ 235 │ │ │ │ │ # Backend is not set by the user, we set it here │
│ 236 │ │ │ │ │ kwargs.pop("backend", None) │
│ 237 │ │ │ │ │ self.backend = backend │
│ ❱ 238 │ │ │ │ │ torch.distributed.init_process_group(self.backend, │
│ 239 │ │ │ │ self.num_processes = torch.distributed.get_world_size( │
│ 240 │ │ │ │ self.process_index = torch.distributed.get_rank() │
│ 241 │ │ │ │ self.local_process_index = local_rank │
│ │
│ /users/home/korat/.local/lib/python3.10/site-packages/torch/distributed/dist │
│ ributed_c10d.py:900 in init_process_group │
│ │
│ 897 │ │ │ rendezvous_iterator = rendezvous( │
│ 898 │ │ │ │ init_method, rank, world_size, timeout=timeout │
│ 899 │ │ │ ) │
│ ❱ 900 │ │ │ store, rank, world_size = next(rendezvous_iterator) │
│ 901 │ │ │ store.set_timeout(timeout) │
│ 902 │ │ │ │
│ 903 │ │ │ # Use a PrefixStore to avoid accidental overrides of keys │
│ │
│ /users/home/korat/.local/lib/python3.10/site-packages/torch/distributed/rend │
│ ezvous.py:245 in _env_rendezvous_handler │
│ │
│ 242 │ master_addr = _get_env_or_raise("MASTER_ADDR") │
│ 243 │ master_port = int(_get_env_or_raise("MASTER_PORT")) │
│ 244 │ │
│ ❱ 245 │ store = _create_c10d_store(master_addr, master_port, rank, world_s │
│ 246 │ │
│ 247 │ yield (store, rank, world_size) │
│ 248 │
│ │
│ /users/home/korat/.local/lib/python3.10/site-packages/torch/distributed/rend │
│ ezvous.py:176 in _create_c10d_store │
│ │
│ 173 │ │ return PrefixStore(f"/worker/attempt_{attempt}", tcp_store) │
│ 174 │ else: │
│ 175 │ │ start_daemon = rank == 0 │
│ ❱ 176 │ │ return TCPStore( │
│ 177 │ │ │ hostname, port, world_size, start_daemon, timeout, multi_t │
│ 178 │ │ ) │
│ 179 │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Connection reset by peer