0

I have an image which will run my training job. The training data is in a Cloud SQL database. When I run the cloud_sql_proxy on my local machine, the container can connect just fine.

❯ docker run --rm us.gcr.io/myproject/trainer:latest mysql -uroot -h"'172.17.0.2'" -e"'show databases;'"

    Running: `mysql -uroot -h'172.17.0.2' -e'show databases;'`
    Database
    information_schema
    mytrainingdatagoeshere
    mysql
    performance_schema

I'm using mysql just to test the connection, the actual training command is elsewhere in the container. When I try this via the AI Platform, I can't connect.

❯ gcloud ai-platform jobs submit training firsttry3 \
  --region us-west2 \
  --master-image-uri us.gcr.io/myproject/trainer:latest \
  -- \
  mysql -uroot -h"'34.94.1.2'" -e"'show tables;'"

    Job [firsttry3] submitted successfully.
    Your job is still active. You may view the status of your job with the command

      $ gcloud ai-platform jobs describe firsttry3

    or continue streaming the logs with the command

      $ gcloud ai-platform jobs stream-logs firsttry3
    jobId: firsttry3
    state: QUEUED

❯ gcloud ai-platform jobs stream-logs firsttry3

    INFO    2019-12-16 22:58:23 -0700   service     Validating job requirements...
    INFO    2019-12-16 22:58:23 -0700   service     Job creation request has been successfully validated.
    INFO    2019-12-16 22:58:23 -0700   service     Job firsttry3 is queued.
    INFO    2019-12-16 22:58:24 -0700   service     Waiting for job to be provisioned.
    INFO    2019-12-16 22:58:26 -0700   service     Waiting for training program to start.
    ERROR   2019-12-16 22:59:32 -0700   master-replica-0        Entered Slicetool Container
    ERROR   2019-12-16 22:59:32 -0700   master-replica-0        Running: `mysql -uroot -h'34.94.1.2' -e'show tables;'`
    ERROR   2019-12-16 23:01:44 -0700   master-replica-0        ERROR 2003 (HY000): Can't connect to MySQL server on '34.94.1.2'

It seems like the host isn't accessible from wherever the job gets run. How can I grant AI platform access to Cloud Sql?

I have considered including the cloud sql proxy in the training container, and then injecting service account credentials as user args, but since they're both in the same project I was hoping that there would be no need for this step. Are these hopes misplaced?

MatrixManAtYrService
  • 8,023
  • 1
  • 50
  • 61

2 Answers2

1

So unfortunately, not all Cloud products get sandboxed into the same network, so you won't be able to connect automatically between products. So the issue you're having is that AI Platform can't automatically reach the Cloud SQL instance at the 34.xx.x.x IP address.

There's a couple ways you can look into fixing it, although caveat, I don't know AI Platform's networking setup well (I'll have to do it and blog about it here soonish). First, is you can try to see if you can connect AI Platform to a VPC (Virtual Private Cloud) network, and put your Cloud SQL instance into the same VPC. That will allow them to talk to each other over a Private IP (going to likely be different than the IP you have now). In the Connection details for the Cloud SQL instance you should see if you have a Private IP, and if not, you can enable it in the instance settings (requires a shutdown and restart). Otherwise, you can be sure a Public IP address is setup, which might be the 34.xx.x.x IP, and then allowlist (whitelist, but I'm trying to change the terminology) the Cloud IP address for AI Platform.

You can read about the way GCP handles IP ranges here: https://cloud.google.com/compute/docs/ip-addresses/

Once those ranges are added to the Authorized Networks in the Cloud SQL connection settings you should be able to connect directly from AI Platform.


Original response:

Where's the proxy running when you're trying to connect to it from the AI platform? Still on your local machine? So basically, in scenario 1, you're running the container locally with docker run, and connecting to your local IP: 172.17.0.2, and then when you shift up to the AI platform, you're connecting to your local machine at 34.xx.x.x? So first, you probably want to remove your actual home IP address from your original question. People are rude on the internet and that could end badly if that's really your home IP. Second, how sure are you that you've opened a hole in your firewall to allow traffic in from the AI platform? Generally speaking, that would be where I'd assume the issue is, that your connection on your local machine is being refused, and the error that results is the unable to connect.

Gabe Weiss
  • 3,134
  • 1
  • 12
  • 15
  • In the first example the proxy is running in my office, and I'm connecting to it from a local docker container also in the office (just to prove to myself that the container is set up correctly). In the second instance (where the AI platform is involved) I don't have a proxy running at all. I figured both sides of the connection are in the google cloud, running because I told them to, so there should be no need for there to be a a proxy in the mix. Is that incorrect? – MatrixManAtYrService Dec 17 '19 at 16:51
  • So 34.x.x.x is the google cloud sql ip address of my instance, and 172.17.0.2 is the local address of the docker container which is running the cloud sql proxy – MatrixManAtYrService Dec 17 '19 at 16:52
  • Oooooh yeah no, that won't work. :D If you're connecting directly to the CloudSQL instance, you need to do some more magic... like being sure either, that the AI instance and your CloudSQL instance are on the same VPC and using the private IP connection, or making sure that you're using the Public IP connection option on the Cloud SQL instance is enabled, and you've authorized the AI platform's IP. I forget what it is, but there are blocks of IPs that it'll use, and you need to be sure to allowlist (whitelist, but more PC) the IP of the platform so it can connect directly. – Gabe Weiss Dec 17 '19 at 17:10
  • To be a bit more explicit, networking in GCP isn't as straightfoward as it probably ought to be. Different components are actually isolated network-wise. When you say "AI platform" which product are you talking about? I haven't tried to do anything with pulling training data from a CloudSQL instance yet, but there's probably a blog post there somewhere. :) If that's what it is, I'll edit my answer above so it's clear what the fix is/was so you can accept it. – Gabe Weiss Dec 17 '19 at 17:11
  • To answer your question, I'm trying to use whatever product this refers to: https://cloud.google.com/ml-engine/docs/training-jobs I'm not sure why I had assumed that all of the resources in the same project would be sandboxed together and implicitly whitelisted (except that I'm new to this cloud stuff). You've helped dispel that notion for me, which is very helpful towards finding a plan (I think I'll just bundle the proxy with the training job). If you edit your question to indicate that this automatic sandboxing doesn't occur, I'll mark it as accepted. Thank you. – MatrixManAtYrService Dec 17 '19 at 19:59
  • Yup! Edited the answer. – Gabe Weiss Dec 17 '19 at 21:49
0

Here's how I did it without the need for the VPC peering or a seperate proxy and entirely via my Python project.

  1. I used the Cloud SQL Connector for Python Drivers in my custom container. As an aside, I'd recommend using the the connector as the default method to connect to Cloud SQL instances in your applications as it abstracts the connection details across environments. You just need to make sure your environment has the proper Application Default Credentials to connect. No proxy required.

  2. Ran the job using a custom service account I created that includes the Cloud SQL Client plus a custom version of the Platform AI Service Agent with the iam.serviceAccounts.actAs permission added to the role as specified here. https://cloud.google.com/ai-platform/training/docs/reference/rest/v1/projects.jobs#ReplicaConfig

  3. You can't launch jobs that use a custom Service Account via the UI but you can quite easily do it programmatically making it quicker and much more configurable. Sample code:

from oauth2client.client import GoogleCredentials
from googleapiclient import discovery
from googleapiclient import errors
from time import time


project_name = YOUR_PROJECT
project_id = 'projects/{}'.format(project_name)
projectNumber = 1234 # retrieved via Google Cloud SDK Shell: gcloud projects describe YOUR_PROJECT --format="value(projectNumber)"



trainingInputs = {
        "region": "us-east4",
    "masterConfig": {
        "imageUri": "gcr.io/my_project/my_image",
        
    },
    "serviceAccount":"****@your_project.iam.gserviceaccount.com"
}

# https://cloud.google.com/ai-platform/training/docs/reference/rest/v1/projects.jobs#ReplicaConfig
job = {
    "jobId": f"TestJob_{int(time())}",
    "labels": {
        "custom_label":"label_value"
    },
    "trainingInput": trainingInputs
}


# https://cloud.google.com/ai-platform/training/docs/reference/rest/v1/projects.jobs/create
cloudml = discovery.build('ml','v1')
request = cloudml.projects().jobs().create(body=job,
              parent=project_id)
try:
    response = request.execute()
    # You can put your code for handling success (if any) here.
    print(response)

except errors.HttpError as err:
    print('There was an error creating the training job.'
                  ' Check the details:')
    print(err._get_reason())
FakeFootball
  • 448
  • 3
  • 11