21

I am using a cluster (similar to slurm but using condor) and I wanted to run my code using VS code (its debugger specially) and it's remote sync extension.

I tried running it using my debugger in VS code but it didn't quite work as expected.

First I logged in to the cluster using VS code and remote sync as usual and that works just fine. Then I go ahead an get an interactive job with the command:

condor_submit -i request_cpus=4 request_gpus=1

then that successfully gives a node/gpu to use.

Once I have that I try to run the debugger but somehow it logs me out from the remote session (and it looks like it goes to the head node from the print statements). That's NOT what I want. I want to run my job in the interactive session in the node/gpu I was allocated. Why is VS code running it in the wrong place? How can I run it in the right place?


Some of the output from the integrated terminal:

source /home/miranda9/miniconda3/envs/automl-meta-learning/bin/activate
/home/miranda9/miniconda3/envs/automl-meta-learning/bin/python /home/miranda9/.vscode-server/extensions/ms-python.python-2020.2.60897-dev/pythonFiles/lib/python/new_ptvsd/wheels/ptvsd/launcher /home/miranda9/automl-meta-learning/automl/automl/meta_optimizers/differentiable_SGD.py 
conda activate base
(automl-meta-learning) miranda9~/automl-meta-learning $ source /home/miranda9/miniconda3/envs/automl-meta-learning/bin/activate
(automl-meta-learning) miranda9~/automl-meta-learning $ /home/miranda9/miniconda3/envs/automl-meta-learning/bin/python /home/miranda9/.vscode-server/extensions/ms-python.python-2020.2.60897-dev/pythonFiles/lib/python/new_ptvsd/wheels/ptvsd/launcher /home/miranda9/automl-meta-learning/automl/automl/meta_optimizers/differentiable_SGD.py 
--> main in differentiable SGD
hello world torch_utils!
vision-sched.cs.illinois.edu
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
-> initialization of DiMO done!

---> i = 0, iteration/it 1 about to start
lp_norms(mdl) = 18.43514633178711
lp_norms(meta_optimized mdl) = 18.43514633178711
[e=0,it=1], train_loss: 2.304989814758301, train error: -1, test loss: -1, test error: -1

---> i = 1, iteration/it 2 about to start
lp_norms(mdl) = 18.470401763916016
lp_norms(meta_optimized mdl) = 18.470401763916016
[e=0,it=2], train_loss: 2.3068909645080566, train error: -1, test loss: -1, test error: -1

---> i = 2, iteration/it 3 about to start
lp_norms(mdl) = 18.548133850097656
lp_norms(meta_optimized mdl) = 18.548133850097656
[e=0,it=3], train_loss: 2.3019633293151855, train error: -1, test loss: -1, test error: -1

---> i = 0, iteration/it 1 about to start
lp_norms(mdl) = 18.65604019165039
lp_norms(meta_optimized mdl) = 18.65604019165039
[e=1,it=1], train_loss: 2.308889150619507, train error: -1, test loss: -1, test error: -1

---> i = 1, iteration/it 2 about to start
lp_norms(mdl) = 18.441967010498047
lp_norms(meta_optimized mdl) = 18.441967010498047
[e=1,it=2], train_loss: 2.300947666168213, train error: -1, test loss: -1, test error: -1

---> i = 2, iteration/it 3 about to start
lp_norms(mdl) = 18.545459747314453
lp_norms(meta_optimized mdl) = 18.545459747314453
[e=1,it=3], train_loss: 2.30662202835083, train error: -1, test loss: -1, test error: -1
-> DiMO done training!
--> Done with Main
(automl-meta-learning) miranda9~/automl-meta-learning $ conda activate base
(automl-meta-learning) miranda9~/automl-meta-learning $ hostname vision-sched.cs.illinois.edu

Doesn't even run without debugging mode

The problem is more serious than I thought. I can't run the debugger in the interactive session but I can't even "Run Without Debugging" without it switching to the Python Debug Console on it's own. So that means I have to run things manually with python main.py but that won't allow me to use the variable pane...which is a big loss!

What I am doing is switching my terminal to the conoder_ssh_to_job and then clicking the button Run Without Debugging (or ^F5 or Control + fn + f5) and although I made sure to be on the interactive session at the bottom in my integrated window it goes by itself to the Python Debugger window/pane which is not connected to the interactive session I requested from my cluster...


related:

Charlie Parker
  • 5,884
  • 57
  • 198
  • 323

3 Answers3

13

I stumbled upon a related issue recently (I wanted to use VsCode interactive Python capabilities on a compute node) and the above weren't working but this solved it:

  1. ssh to the remote cluster ssh cluster
  2. inside the remote cluster, add my public key to the authorized keys, so typically append the content of ~/.ssh/id_rsa.pub (local machine) to .ssh/authorized_keys (remote cluster)
  3. allocate some resources inside the cluster (this particular cluster uses slurm and not condor so in this case I use something like srun --pty bash)
  4. get the name of the compute node, typically visible in the command line as username@nodename). For argument's sake, let's imagine I get a generic name like node001
  5. for simplicity on my local machine, modify the ~/.ssh/config file and edit it as:
Host cluster
   # stuff written

Host node*
    HostName %h
    ProxyJump cluster
    User $USERNAME

Now I'm able to ssh to it from my local machine (as long as the compute node is running) with ssh node001.

In VsCode this boils down to

  1. CTRL+P > Remote-SSH: Connect to Host...
  2. type in the name of the node, here node001
  3. you get connected to the node, now every interactive python you run (including jupyter and jupytext) will have access to your allocated resources

I don't know how generic this solution is, I hope it'll help at least somebody !

Louis
  • 161
  • 1
  • 5
12

You can try reversing the order of operations; first submitting the job, obtaining the name of the compute node allocated to you, then instructing VSCode to connect to the compute node rather than the login node.

So first would be

condor_submit -i request_cpus=4 request_gpus=1

and noting the name of the compute node. Assuming node001 in the following.

Then, open VSCode on your laptop, click on the Remote Development extension icon and choose "Remote SSH: Connect to Host...". Choose "+ Add new SSH host...". In the "Enter SSH command" box, add the following:

ssh -J vision-sched.cs.illinois.edu miranda9@node001

The VSCode will ask you which SSH configuration file it should update. Make sure to review that configuration: specify the SSH keys if needed, the user name, etc. Also make sure you have the vision-sched.cs.illinois.edu correctly configured in that file.

Then you can choose that host to connect to. VSCode will then execute on the compute node, and will be disconnected when the allocation finishes.

damienfrancois
  • 52,978
  • 9
  • 96
  • 110
  • 1
    what is the `-J` option for? Why is it needed, it seems like something extra (commenting to make answer more self contained) – Charlie Parker Jul 07 '20 at 20:28
  • I tried your suggestion but it didn't work. I usually connect with a password. I entered the password and it did not work. Is there something else you think I could do? (btw, I'm impressed you came up with this even if it's currently not working. Kudos). – Charlie Parker Jul 07 '20 at 20:33
  • 2
    The `-J` option is to use a proxy. As the compute node is not connected to the public internet, you have to go through the login node to access it. – damienfrancois Jul 07 '20 at 20:44
  • Thats what I assumed you were doing, but somehow I still failed to connect to it. Is there anyway to debug why it's not allowing me to connect? – Charlie Parker Jul 07 '20 at 21:26
  • did you try your own solution? I'm curious because at least we'd have that it works for someone (and removes myself as a variable and it's not just me). thanks for your time. – Charlie Parker Jul 07 '20 at 23:22
  • yes I tested it with VSCode 1.46.1 on MacOS 10.14.6 (SSH version OpenSSH_7.9p1) to a Slurm cluster to which I have access with an SSH key – damienfrancois Jul 08 '20 at 06:23
  • try first to get the `ssh -J ...` part to work. Best way is to configure the connections in your `.ssh/config` file. Create SSH key pairs and install them in your cluster. Also check the documentation of your cluster to see if there is any information about SSHing to a compute node? – damienfrancois Jul 08 '20 at 06:24
  • @damienfrancois, First thanks for the solution. I have allocated 2 GPUs for my interactive slurm job but when I open a ssh session to the compute node, all the GPUs are exposed to VS code. How can I avoid that? Could you please elaborate more details on how did you test this with SLURM? Thank you – arash Aug 20 '20 at 21:20
  • You can look which GPUs are allocated to your job with `scontrol -d show ` and then setup the `NVIDIA_VISIBLE_DEVICES` env variable accordingly. – damienfrancois Aug 21 '20 at 13:48
  • This fails on compute nodes that only allow read-only access to the home filesystem. – dbrane Jul 05 '21 at 20:23
6

Here is a simpler workaround:

  1. on the remote server create a file named bash somewhere for example /home/myuser/pathto/bash
  2. make it executable using chmod +x bash
  3. write salloc [your desired options for the interactive job] in the bash file
  4. In vscode Settings search for Automation Shell: Linux and click on the "Edit in settings.js"
  5. change the line to "terminal.integrated.automationShell.linux": "/home/myuser/pathto/bash" and save it (use the absolute path. for example ~/pathto/bash didn't work for me)
  6. Done :)

now every time you run the debugger it will first ask for the interactive job and the debugger will run on it. but take in to consider that this is also applied to tasks you run in tasks.json.
also you can use srun instead of salloc. for example srun --pty -t 2:00:00 --mem=8G bash

asalimih
  • 298
  • 2
  • 8
  • I'm having trouble trying to debug a C program with this approach, when I launch a debug session on vscode it creates some random files on login node's `/tmp` and then fails when it doesn't find them on compute node's `/tmp`. Any idea how to fix this? – Seralpa Sep 26 '22 at 02:21
  • @Seralpa Unfortunately I think this approach only works when login node and compute node are the same. – asalimih Oct 01 '22 at 11:38