Questions tagged [google-cloud-tpu]

Google Cloud TPUs (Tensor Processing Units) accelerate machine learning workloads developed using TensorFlow. This tag is used for questions about using the Google Cloud TPU service. Topics can range from the service user experience, issues with the trainer program written with Tensorflow, project quota issues, security, authentication, etc.

Official website

188 questions
1
vote
3 answers

TPU custom chip available with Google Cloud ML

Which type of Hardware is used as part of Google Cloud ML when using TensorFlow? Only CPU or Tensor Processing Unit (custom cards) are also available? cf this article
Nick
  • 23
  • 2
0
votes
1 answer

Cloud TPU - Migrating off of older TF versions

If I am a Cloud TPU user and I have several TPU nodes with older TF versions (<=2.6.x) which are soon to be deprecated, is it possible to get support from the Cloud TPU team with migration? Please assign this issue the highest priority possible as…
0
votes
2 answers

Error while trying to use GCP VM Instance with TPU VM

I created a VM instance in GCP with Pytorch XLA environment. And I created a TPU-VM with tpu-vm-pt-2.0. I SSHed into the VM instance and activated the conda environment with pytorch-xla. But, when I try to test a sample script to test for TPU,…
0
votes
1 answer

Access to the v4 TPUs

For our purposes, we would really like to have access to the v4 TPUs. We found the Google form and filled it out a few weeks ago, but it seems we've thrown a dart into an abyss, with no response. Is there any way to accelerate/another method to get…
0
votes
1 answer

Trouble connecting to GCP TPU VM

I followed along with the instructions to create a cloud TPU VM and run a custom neural network as directed by the Run Tensorflow on TPU pod slices to a T. It's important to note that I have been able to initialize the cloud TPUs when running this…
0
votes
1 answer

Running Pytorch on Cloud TPU VM on GCP gives INVALID_ARGUMENT: No matching devices found for '/job:localservice/replica:0/task:0/device:TPU_SYSTEM:0'

I created a TPU VM on GCP. I am following the documentation page on how to run a calculation on a Cloud TPU VM by using PyTorch I have set the XRT TPU device configuration in the VM with export XRT_TPU_CONFIG="localservice;0;localhost:51011" I…
BioGeek
  • 21,897
  • 23
  • 83
  • 145
0
votes
1 answer

TPU not found on Google VM (jax version 0.2.16)

I'm running a TPU v3-8 VM on Google. On the VM, I installed jax with pip install "jax[tpu]==0.2.16" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html. Unfortunately, I'm getting the message No GPU/TPU found, falling back to CPU,…
BlackHawk
  • 719
  • 1
  • 6
  • 18
0
votes
1 answer

How to understand the padding rules on cloud TPU?

Cloud TPU has two padding rules on batch_size and feature_size of convolution operations, to minimize memory overhead and maximize computational efficiency (from here). The total batch size should be a multiple of 64 (8 per TPU core), and feature…
0
votes
1 answer

Connection refused when switching TPU version

How can I switch TPU version for TPU VM architechture? When attempting to switch software version for TPU(TPU VM architechture switching from tpu-vm-tf-2.6.0-pod to tpu-vm-base) using instructions found here, I get Connection Refused exception with…
Nevus
  • 1,307
  • 1
  • 9
  • 21
0
votes
1 answer

How to fix "INVALID_ARGUMENT: Cloud TPU received an invalid argument. The "GuestAttributes" value "" was not found."?

I recently started using TPUv3-8 VMs to train language models and haven't had any issues with VMs crashing or the like. However, one of my TPU VMs seems to now have broken out of nowhere and I am completely lost. When trying to ssh to the VM, I get…
0
votes
0 answers

TPU VM access Cloud Storage 403 forbidden when writing files

When I run my python command to train my model on my tpu-vm, it failed on writing files to Cloud Storage. Traceback (most recent call last): File "device_train.py", line 302, in save(network, step, bucket, model_dir, File…
0
votes
0 answers

TPU VM ssh connect unstable or disconnect after some seconds

When I use the command proxychains gcloud alpha compute tpus tpu-vm ssh xx --zone zone to connect TPU VM, the connection only lasts 5 to 10 seconds. This is very bad because I don't have time to get it to execute my command. I have checked the…
csliu_jia
  • 1
  • 1
0
votes
1 answer

Write on GCP bucket from TPU vm

I am training a bert model using a TPU vm on GCP. I want to use my bucket as the Datasets library Cache filepath. I have followed instructions from https://cloud.google.com/tpu/docs/tutorials/bert-2.x and set my bucket link in the HF_DATASETS_CACHE…
0
votes
1 answer

TPU training fails with certain metric, succeeds on CPU

I'm trying to train a simple EfficientNet style model on some images. Training works fine on a CPU, but when I switch across to using a TPU I get the following error: (0) Invalid argument: {{function_node __inference_train_function_38255}} Output…
dgmp88
  • 537
  • 6
  • 13
0
votes
2 answers

How can I use a Cloud TPU with Tensorflow Lite Model Maker?

I'm training an object detection model (EfficientDet-Lite) using Tensorflow Lite Model Maker in Colab and I'd like to use a Cloud TPU. I have all the images in a GCS bucket and provide a CSV file. When I call object_detector.create I get the…