How can we figure out why we're getting occasional 401 errors from the Google c2dm push service on some AWS instances when asking c2dm to deliver a notification?
This is a transient problem. All AWS instances are mostly successful in sending an HTTPS request to Google c2dm, some instances are successful 100% of the time, and some some get 401's occasionally. So we do not believe that this is an issue with our c2dm registration or our notification code (python) that has been in production for over a year. The 401 errors started on May 16, 2012.
Instead, we're thinking that something in the Amazon infrastructure, including DNS caching, may somehow be involved in the problem. Google kindly replied to our inquiry saying:
I'd look for something that could cause flaky communications. Try and see if you're getting unusual numbers of corrupted or dropped packets on that machine's network adapter.
However, we do not see any evidence of "flaky communications". The cpu load on the instances is nearly 0 when the problems occur, and the number of ethernet connections on troublesome machines is, on average, lower than on instances which have no problems.
One clue is that the 401 errors seem to occur in a clump (several happening within about 4 minutes of each other), and that clumps are often spaced by 10 to 60 minutes apart (though there can be many hours without errors). We don't see I/O errors or "flaky communications" errors, just 401 errors from Google c2dm.
A serverfault post led us to think about DNS caching on AWS as it relates to the SSL validation of the hostname in the certificate offered up by the Google c2dm service, but it seems that the python 2.7 urllib2 that we use does not validate the host by default.
Another clue is that we changed the IP address of the first web instance that showed the problem, using the "elastic IP" feature: same, continuously-running instance, just with a new IP. That instance became 100% successful for 4 days, but has since returned to having occasional 401s.
What can we do that will shed light on this?
A stack trace sample:
c2dm push error: HTTP Error 401: Unauthorized
Traceback (most recent call last):
File "/home/django/base/src/mmsite/push/models.py",
line 262, in send_c2dm_message
response = urllib2.urlopen(request) # third try
File "/usr/local/lib/python2.7/urllib2.py",
line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/local/lib/python2.7/urllib2.py",
line 400, in open
response = meth(req, response)
File "/usr/local/lib/python2.7/urllib2.py",
line 513, in http_response 'http', request, response, code, msg, hdrs)
File "/usr/local/lib/python2.7/urllib2.py",
line 438, in error
return self._call_chain(*args)
File "/usr/local/lib/python2.7/urllib2.py",
line 372, in _call_chain
result = func(*args)
File "/usr/local/lib/python2.7/urllib2.py",
line 521, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 401: Unauthorized
Sample HTTP headers returned in 401 reply:
'headers': [
'Content-Type: text/html; charset=UTF-8\r\n', 'Date: Fri, 25 May 2012 00:24:25 GMT\r\n',
'Expires: Fri, 25 May 2012 00:24:25 GMT\r\n', 'Cache-Control: private, max-age=0\r\n',
'X-Content-Type-Options: nosniff\r\n', 'X-Frame-Options: SAMEORIGIN\r\n',
'X-XSS-Protection: 1; mode=block\r\n', 'Server: GSE\r\n', 'Connection: close\r\n']
Edit for additional test info:
We were able to reproduce this transient 401 on a development network. Sometimes it worked, sometimes it got a 401. Since the development network is completely separate from AWS, this removes all the variables we were considering about AWS, and gives weight to the theory that the issue is on the Google side. Google kindly replied that they would escalate the issue.