0

I've made an ACS instance.

az acs create --orchestrator-type=kubernetes \
          --resource-group $group \
          --name $k8s_name \
          --dns-prefix $kubernetes_server \
          --generate-ssh-keys

az acs kubernetes get-credentials --resource-group $group --name $k8s_name

And run helm init it has provisioned tiller pod fine. I then ran helm install stable/redis and got a redis deployment up and running (seemingly).

I can kube exec -it into the redis pod, and can see it's binding on 0.0.0.0 and can log in with redis-cli -h localhost and redis-cli -h <pod_ip>, but not redis-cli -h <service_ip> (from kubectl get svc.)

kube dashboard

If I run up another pod (which is how I ran into this issue) I can ping redis.default and it shows the DNS resolving to the correct service IP but gives no response. When I telnet <service_ip> 6379 or redis-cli -h <service_ip> it hangs indefinitely.

shell

I'm at a bit of a loss as to how to debug further. I can't ssh into the node to see what docker is doing.

Also, I'd initially tried this with a standard Alphine-Redis image, so the helm was a fallback. I tried it yesterday and the helm one worked, but the manual one didn't. Today doing it (on a newly built ACS cluster) it's not working at all on either.

I'm going to spin up the cluster again to see if its a stable reproduce, but I'm pretty confident something fishy is going on.

PS - I have a VNet with overlapping subnet 10.0.0.0/16 in a different region, when I go into the address range I do get a warning there that there is a clash, could that affect it?

<EDIT>

Some new insight... It's something to do with alpine based images (which we've been aiming to use)...

So kube run a --image=nginx (which is ubuntu based) and I can shell in, install telnet and connect to the redis service.

But, e.g. kubectl run c --image=rlesouef/alpine-redis then shell in, and telnet doesn't work to the same redis service.

</EDIT>

Chris
  • 1,241
  • 1
  • 14
  • 33
  • 1
    Unless you're peering them, you can ignore the warning you mention in your PS; not sure why the Portal is so aggressive about warning on this... – colemickens Apr 28 '17 at 17:48

1 Answers1

2

There was a similar issue https://github.com/Azure/acs-engine/issues/539 that has been fixed recently. One thing to verify is to check if nslookup works in the container.

jiangtli
  • 36
  • 1
  • I was assume that ping resolving the service ip from the service name meant dns was ok. Is this issue suggesting it should be the pod ip that is getting resolved? What extra input does nslookup provide? – Chris Apr 28 '17 at 17:59
  • Now that I've reproduce on alpine, I can confirm the behaviour exists both when the containers are on the same node, and a different node, so I don't think this issue is the answer. – Chris Apr 28 '17 at 21:33
  • Issue #539 fixes connectivity between containers. You can confirm your are affected if the file `/etc/systemd/system/docker.service.d/exec_start` contains `--iptables=false --ip-masq=false` – A Howe May 04 '17 at 15:24
  • Also, what is alpine? Two things I noticed when I ran `helm install stable/redis`: 1. the redis container took 4 minutes to get started. 2. I needed to extract the password from the pod (`REDIS_PASSWORD=$(kubectl get secret --namespace default laughing-bee-redis -o jsonpath="{.data.redis-password}" | base64 --decode`). Once I confirmed the pod was running, and used the password, I was able to connect to the redis service. – A Howe May 04 '17 at 15:29
  • In the above I notice you using telnet. Do you try to type any commands into the telnet window? Can you instead try using redis-cli? Here is what i do from shelling into an nginx container: `apt-get update; apt-get install redis-tools`. Then I run `redis-cli -h laughing-bee-redis -a $PASSWORD`, and I can connect and use the commands. – A Howe May 04 '17 at 15:35
  • alpine is a minimal linux distribution for docker. I've just tested and it's currently working on an alpine-pause image... i'm going to revisit tomorrow and confirm that its now working on our actual setup... and I was running "INFO" with telnet, and it was giving password errors, which means connectivity is good - i wasn't getting that before – Chris May 04 '17 at 16:35
  • Checked again today and its now working for my real setup. I don't know what has changed in a week, and I can't explain it. Shall I delete this question, as I don't think there is anything useful in it for anyone else? – Chris May 05 '17 at 09:57