0

I created a VM instance in GCP with Pytorch XLA environment. And I created a TPU-VM with tpu-vm-pt-2.0.

I SSHed into the VM instance and activated the conda environment with pytorch-xla. But, when I try to test a sample script to test for TPU, returns an error as follows:

\2023-04-17 19:35:38.550666: F    5184 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1362\] Non-OK-status: session.Run({tensorflow::Output(result, 0)}, &outputs) status: UNIMPLEMENTED: method "RunStep" not implemented
\*\*\* Begin stack trace \*\*\*
tsl::CurrentStackTrace()
xla::XrtComputationClient::InitializeAndFetchTopology(std::string const&, int, std::string const&, tensorflow::ConfigProto const&)
xla::XrtComputationClient::InitializeDevices(std::unique_ptr\<tensorflow::tpu::TopologyProto, std::default_delete\<tensorflow::tpu::TopologyProto\> \>)
xla::XrtComputationClient::XrtComputationClient(xla::XrtComputationClient::Options, std::unique_ptr\<tensorflow::tpu::TopologyProto, std::default_delete\<tensorflow::tpu::TopologyProto\> \>)
xla::ComputationClient::Create()

        xla::ComputationClient::Get()
    
    
        PyCFunction_Call
        _PyObject_MakeTpCall
        _PyEval_EvalFrameDefault
        _PyFunction_Vectorcall
    
    
        _PyObject_GenericGetAttrWithDict
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_Vectorcall
    
    
    
        _PyEval_EvalCodeWithName
        PyEval_EvalCode
    
    
    
        PyRun_SimpleFileExFlags
    
        Py_BytesMain
        __libc_start_main

        End stack trace 

Aborted\

Can someone help me debug?

I tried the quickstart guides and the pytorch tutorials from the documentations, but I don't know what I am doing wrong. For instance, I also tried with both my VM instance and TPU instance with the same zone but still the error. I tried running the code as XRT_TPU_CONFIG="tpu_worker;0;{IP_ADDRESS}:8470" python test.py too, but still the error.

mr oogway
  • 1
  • 3

2 Answers2

0

since you are using tpu-vm-pt-2.0, it is recommended that you use PjRt runtime. What ACCELERATOR_TYPE are you using? depending on that you might need to follow differen guides. Can you please test if the resnet example works following the steps here? https://github.com/pytorch/xla/blob/master/docs/pjrt.md#tpu

Susie Sargsyan
  • 191
  • 1
  • 8
0

If you are using TPU Node architechture, you can try switching to TPU VM. TPU VM architecture is recommended over TPU Node, easier to use and faster. Here is the guide to create and run TPU VM: https://cloud.google.com/tpu/docs/run-calculation-pytorch#tpu-vm