Error while trying to use GCP VM Instance with TPU VM

Question

I created a VM instance in GCP with Pytorch XLA environment. And I created a TPU-VM with tpu-vm-pt-2.0.

I SSHed into the VM instance and activated the conda environment with pytorch-xla. But, when I try to test a sample script to test for TPU, returns an error as follows:

\2023-04-17 19:35:38.550666: F    5184 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1362\] Non-OK-status: session.Run({tensorflow::Output(result, 0)}, &outputs) status: UNIMPLEMENTED: method "RunStep" not implemented
\*\*\* Begin stack trace \*\*\*
tsl::CurrentStackTrace()
xla::XrtComputationClient::InitializeAndFetchTopology(std::string const&, int, std::string const&, tensorflow::ConfigProto const&)
xla::XrtComputationClient::InitializeDevices(std::unique_ptr\<tensorflow::tpu::TopologyProto, std::default_delete\<tensorflow::tpu::TopologyProto\> \>)
xla::XrtComputationClient::XrtComputationClient(xla::XrtComputationClient::Options, std::unique_ptr\<tensorflow::tpu::TopologyProto, std::default_delete\<tensorflow::tpu::TopologyProto\> \>)
xla::ComputationClient::Create()

        xla::ComputationClient::Get()
    
    
        PyCFunction_Call
        _PyObject_MakeTpCall
        _PyEval_EvalFrameDefault
        _PyFunction_Vectorcall
    
    
        _PyObject_GenericGetAttrWithDict
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_Vectorcall
    
    
    
        _PyEval_EvalCodeWithName
        PyEval_EvalCode
    
    
    
        PyRun_SimpleFileExFlags
    
        Py_BytesMain
        __libc_start_main

        End stack trace 

Aborted\

Can someone help me debug?

I tried the quickstart guides and the pytorch tutorials from the documentations, but I don't know what I am doing wrong. For instance, I also tried with both my VM instance and TPU instance with the same zone but still the error. I tried running the code as XRT_TPU_CONFIG="tpu_worker;0;{IP_ADDRESS}:8470" python test.py too, but still the error.

score 0 · Answer 1 · answered Apr 26 '23 at 19:56

0

since you are using tpu-vm-pt-2.0, it is recommended that you use PjRt runtime. What ACCELERATOR_TYPE are you using? depending on that you might need to follow differen guides. Can you please test if the resnet example works following the steps here? https://github.com/pytorch/xla/blob/master/docs/pjrt.md#tpu

answered Apr 26 '23 at 19:56

Susie Sargsyan

191
1
8

I tried with PJRT too, same error. I tried testing adding two tensors when this error happens. – mr oogway Apr 27 '23 at 14:58

score 0 · Answer 2 · answered May 01 '23 at 17:28

0

If you are using TPU Node architechture, you can try switching to TPU VM. TPU VM architecture is recommended over TPU Node, easier to use and faster. Here is the guide to create and run TPU VM: https://cloud.google.com/tpu/docs/run-calculation-pytorch#tpu-vm

answered May 01 '23 at 17:28

Liyang Lu

1

I am using TPU-VM architecture! The above results in TPU-VM as well. – mr oogway May 02 '23 at 18:31

Error while trying to use GCP VM Instance with TPU VM

2 Answers2