0

I'm creating an instance on GCP running a startup-script that should take about 30 minutes to install everything, including running a python machine-learning code.

I'm creating it with something like that:

gcloud compute instances create XXXXX \
  --project YYYYY \
  --machine-type='a2-highgpu-1g' \
  --zone='us-central1-a' \
  --image-project='AAAAA' \
  --image-family='BBBBBB' \
  --boot-disk-size=50GB \
  --accelerator type=CCCCCCC,count=1 \
  --metadata "DDDDDDDD" \
  --maintenance-policy TERMINATE --restart-on-failure \
  --scopes https://www.googleapis.com/auth/cloud-platform \
  --metadata-from-file startup-script=start-script.sh

At the end of the start-script.sh, there is a line that is running a Python script, I can see the log normally, but after a few minutes (probably doing a part without verbose), I'm getting the following:

Dec  3 16:21:01 home CRON[26644]: (root) CMD (/opt/deeplearning/bin/run_diagnostic_tool.sh 2>&1)
Dec  3 16:21:01 home CRON[26643]: (CRON) info (No MTA installed, discarding output)
Dec  3 16:22:01 home CRON[26679]: (root) CMD (/opt/deeplearning/bin/run_diagnostic_tool.sh 2>&1)
Dec  3 16:22:01 home CRON[26678]: (CRON) info (No MTA installed, discarding output)
Dec  3 16:23:01 home CRON[26713]: (root) CMD (/opt/deeplearning/bin/run_diagnostic_tool.sh 2>&1)
Dec  3 16:23:01 home CRON[26712]: (CRON) info (No MTA installed, discarding output)
Dec  3 16:24:01 home CRON[26749]: (root) CMD (/opt/deeplearning/bin/run_diagnostic_tool.sh 2>&1)
Dec  3 16:24:02 home CRON[26748]: (CRON) info (No MTA installed, discarding output)
Dec  3 16:24:55 home google_metadata_script_runner[778]: error while communicating with "startup-script" script: bufio.Scanner: token too long
Dec  3 16:24:58 home google_metadata_script_runner[778]: startup-script exit status 0
Dec  3 16:24:58 home google_metadata_script_runner[778]: Finished running startup scripts.

So, it's clear to me that google_metadata_script_runner is timing out. I don't know whether it's taking too long or has no Python output.

I'm adding at the end of every command this: | tee -a /root/outlog.txt so I know the issues are at the second last line of the script:

. /root/work/venv_diffusers_sd_2/bin/accelerate launch /root/work/diffusers_sd_v2/examples/dreambooth/train_dreambooth.py \
 --gradient_accumulation_steps=1 --pretrained_model_name_or_path="stabilityai/stable-diffusion-2-base" \
 --pretrained_vae_name_or_path "stabilityai/sd-vae-ft-mse" --output_dir=/root/work/train_1/model_out/ --with_prior_preservation \
 --prior_loss_weight=1.0 --resolution=512 --train_batch_size=1 --learning_rate=2e-6 \
 --lr_scheduler="constant" --lr_warmup_steps=0 --num_class_images=200 --max_train_steps=2000 \
 --concepts_list="/root/work/train_1/concepts_list.json" --train_text_encoder --revision="fp16" --mixed_precision="fp16"

It works without issues if I copy and paste it after the startup command fails.

The script uses fileformat=unix

I've been reading about Using startup scripts on Linux VMs, but I can't find a solution. I've tried multiple times and got the same output after about 10 minutes.

  • Add the contents of `startup-script=start-script.sh`. The error **bufio.Scanner: token too long** probably means that a text line is too long or is corrupted. Check the **file format** of the script file (unix versus dos line termination). VIM has the command `:ff=unix` to set the correct file format. – John Hanley Dec 03 '22 at 18:38
  • I know there is a 256 KB size limitation. My script is 7kb long with about 200 lines. Trying the vim command I got: "E492: Not an editor command: ff=unix" on Mac. – Claudio Canales Dec 03 '22 at 18:43
  • That limitation is not related to correct line termination on script files. What is the line content that generates the error? – John Hanley Dec 03 '22 at 18:46
  • It's a 665 characters python command related to PyTorch. I can't paste it here fully. – Claudio Canales Dec 03 '22 at 18:55
  • Put details in the question, not as comments. A 665 character command might be a problem but without details, we can only guess. – John Hanley Dec 03 '22 at 19:14
  • I just pasted it. – Claudio Canales Dec 03 '22 at 19:29
  • What are the answers to my first comment? Update your question with those details. – John Hanley Dec 03 '22 at 19:37

0 Answers0