Google c2dm transient 401 errors on some AWS instances

Question

How can we figure out why we're getting occasional 401 errors from the Google c2dm push service on some AWS instances when asking c2dm to deliver a notification?

This is a transient problem. All AWS instances are mostly successful in sending an HTTPS request to Google c2dm, some instances are successful 100% of the time, and some some get 401's occasionally. So we do not believe that this is an issue with our c2dm registration or our notification code (python) that has been in production for over a year. The 401 errors started on May 16, 2012.

Instead, we're thinking that something in the Amazon infrastructure, including DNS caching, may somehow be involved in the problem. Google kindly replied to our inquiry saying:

I'd look for something that could cause flaky communications. Try and see if you're getting unusual numbers of corrupted or dropped packets on that machine's network adapter.

However, we do not see any evidence of "flaky communications". The cpu load on the instances is nearly 0 when the problems occur, and the number of ethernet connections on troublesome machines is, on average, lower than on instances which have no problems.

One clue is that the 401 errors seem to occur in a clump (several happening within about 4 minutes of each other), and that clumps are often spaced by 10 to 60 minutes apart (though there can be many hours without errors). We don't see I/O errors or "flaky communications" errors, just 401 errors from Google c2dm.

A serverfault post led us to think about DNS caching on AWS as it relates to the SSL validation of the hostname in the certificate offered up by the Google c2dm service, but it seems that the python 2.7 urllib2 that we use does not validate the host by default.

Another clue is that we changed the IP address of the first web instance that showed the problem, using the "elastic IP" feature: same, continuously-running instance, just with a new IP. That instance became 100% successful for 4 days, but has since returned to having occasional 401s.

What can we do that will shed light on this?

A stack trace sample:

c2dm push error: HTTP Error 401: Unauthorized 
Traceback (most recent call last):
   File "/home/django/base/src/mmsite/push/models.py", 
line 262, in send_c2dm_message     
   response = urllib2.urlopen(request) # third try
   File "/usr/local/lib/python2.7/urllib2.py",
 line 126, in urlopen     
    return _opener.open(url, data, timeout)
   File "/usr/local/lib/python2.7/urllib2.py",
 line 400, in open     
    response = meth(req, response)
   File "/usr/local/lib/python2.7/urllib2.py",
 line 513, in http_response     'http', request, response, code, msg, hdrs)   
   File "/usr/local/lib/python2.7/urllib2.py",
 line 438, in error
     return self._call_chain(*args)   
     File "/usr/local/lib/python2.7/urllib2.py", 
line 372, in _call_chain
     result = func(*args)
   File "/usr/local/lib/python2.7/urllib2.py", 
   line 521, in http_error_default
     raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
 HTTPError: HTTP Error 401: Unauthorized

Sample HTTP headers returned in 401 reply:

'headers': [
    'Content-Type: text/html; charset=UTF-8\r\n', 'Date: Fri, 25 May 2012 00:24:25 GMT\r\n', 
    'Expires: Fri, 25 May 2012 00:24:25 GMT\r\n', 'Cache-Control: private, max-age=0\r\n', 
    'X-Content-Type-Options: nosniff\r\n', 'X-Frame-Options: SAMEORIGIN\r\n', 
    'X-XSS-Protection: 1; mode=block\r\n', 'Server: GSE\r\n', 'Connection: close\r\n']

Edit for additional test info:

We were able to reproduce this transient 401 on a development network. Sometimes it worked, sometimes it got a 401. Since the development network is completely separate from AWS, this removes all the variables we were considering about AWS, and gives weight to the theory that the issue is on the Google side. Google kindly replied that they would escalate the issue.

Can someone w/ 300+ reputation create a "c2dm" tag and apply it? — larham1, May 24 '12 at 22:41
Have you verified that the 401 response is actually from Google and not from urllib2 indicating an SSL error? If so, and assuming that your application doesn't have bugs causing incorrect authentication information in the requests, this sounds like an issue on Google's end. — mgorven, May 24 '12 at 22:52
@mgorven thanks for the tag and the question. The call that throws is: response = urllib2.urlopen(request) . AFAICT, urllib2 can throw exceptions for bad args, etc. but would never throw a 401 for something internal. Google's reply to us makes us worried about SSL handshake timeouts, etc. in AWS, but I agree that it could be on their side, and we've asked again. — larham1, May 24 '12 at 23:30
Yeah, all SSL issues should be raised as ssl.SSLError, so this is actually Google (or whatever's on the other side) returning a 401 HTTP response. I doubt that it's a timeout or networking issue. Do you have the HTTP body of the 401 response? I see reference to C2DM messages being limited to 200000 per day -- are you near that volume? — mgorven, May 24 '12 at 23:44
@mgorven Good idea, we just added logging to capture the body... of course, when you need the error, you don't get one :) We've gotten an over-limit error before (per device limit), and it wasn't a 401 but rather adhered to their docs' protocol https://developers.google.com/android/c2dm/#push . We're under 100K. We asked for an increased limit a while ago. And most requests continue to flow ok... — larham1, May 25 '12 at 00:14
Update 29 June 2012: at I/O office hours, a c2dm fellow suggested that the blog entry http://android-developers.blogspot.com/2012/04/android-c2dm-client-login-key.html suggests the authentication keys are going to be expiring, and although we did not see the tell-tale header associated with our 401 responses, it seemed wise to reset the auth key. (We didn't want to do this without a good reason.) We reset it today and are monitoring... — larham1, Jun 30 '12 at 01:08

score 2 · Accepted Answer · answered Jul 07 '12 at 21:24

This was fixed by changing the the auth key (an automated process via a URL request to the c2dm service to get the new key, then putting it immediately into our server-side push code). We were reluctant to do this while the key was working fine for most c2dm pushes, but it looks like some Google servers were unhappy with the key, causing intermittent errors for us. We've been error-free since changing it over a week ago.

Google c2dm transient 401 errors on some AWS instances

1 Answers1