78

I am running a Flask/Gunicorn Python app on a Heroku Cedar dyno. The app returns JSON responses to its clients (it's an API server, really).

Once in a while clients get 0-byte responses. It's not me returning them, however. Here is a snippet of my app's log:

Mar 14 13:13:31 d.0b1adf0a-0597-4f5c-8901-dfe7cda9bce0 app[web.1] [2013-03-14 13:13:31 UTC] 10.104.41.136 apisrv - api_get_credits_balance(): session_token=[MASKED]

The first line above is me starting to handle the request.

Mar 14 13:13:31 d.0b1adf0a-0597-4f5c-8901-dfe7cda9bce0 app[web.1] [2013-03-14 13:13:31 UTC] 10.104.41.136 apisrv 1252148511 api_get_credits_balance(): returning [{'credits_balance': 0}]

The second line is me returning a value (to Flask -- it's a Flask "Response" object).

Mar 14 13:13:31 d.0b1adf0a-0597-4f5c-8901-dfe7cda9bce0 app[web.1] "10.104.41.136 - - [14/Mar/2013:13:13:31] "POST /get_credits_balance?session_token=MASKED HTTP/1.1" 200 22 "-" "Appcelerator Titanium/3.0.0.GA (iPhone/6.1.2; iPhone OS; en_US;)"

The third line is Gnicorn's, where you can see the Gunicorn got the 200 status and 22 bytes HTTP body ("200 22").

However, the client got 0 bytes. Here is the Heroku router log:

Mar 14 13:13:30 d.0b1adf0a-0597-4f5c-8901-dfe7cda9bce0 heroku[router] at=info method=POST path=/get_credits_balance?session_token=MASKED host=matchspot-apisrv.herokuapp.com fwd="66.87.116.128" dyno=web.1 queue=0 wait=0ms connect=1ms service=19ms status=200 bytes=0

Why does Gunicorn return 22 bytes, but Heroku sees 0, and indeed passes back 0 bytes to the client? Is this a Heroku bug?

Paras Joshi
  • 20,427
  • 11
  • 57
  • 70
Nitzan Shaked
  • 13,460
  • 5
  • 45
  • 54
  • 1
    Did you notice, that heroku timestamp is before your proccess timestamp? Do you use gevent? Something wrong with synchronisation I think. – Tigra Mar 23 '13 at 18:34
  • Thanks for commenting. I use the regular Gunicorn sync workers, not gevent nor any other greenlet. Interesting observation regarding the timestamp, but my hypothesis would be that the timestamp written by Heroku is the timestamp at which the router *got* the request. Just a guess. This explains the 1sec difference in timestamps (the real difference of course could be 1ms, the 1ms that rolled over the second count from 30 to 31). At this point I should add that since posting this question I've enabled a new Heroku-labs feature that prints a unique "request id" in Herkou's router logs, (cont'd) – Nitzan Shaked Mar 23 '13 at 19:37
  • (cont'd) and passes that as an HTTP header to my worker, so I can correlate the lines. The lines *do* correlate when the phenomena happens. Also, I've added a gunicorn post-request hook that prints the response, and indeed gunicorn claims to have written and flushed all the response bytes. – Nitzan Shaked Mar 23 '13 at 19:39
  • 2
    And yet, timestamp states 1 second difference, not 1 1ms... I did not work with heroku, so it is only suggestions. 1ms and 1999ms both can give you 1 second difference in timestamp. Service 19ms is also too low to be true on cloud service. So my point is, that probably there are some kind of timeout and on timeout instead of error heroku serves empty page. This suggestion is long shot, but maybe you should emulate long request and see what happens – Tigra Mar 23 '13 at 21:45
  • 9
    How helpful was Heroku when you contacted them with this (out of curiosity)? – orokusaki Mar 27 '13 at 04:56
  • 6
    Not very much so far. I approached them 10 days ago, and was told the Python guys would look at it first and if they can't help me then the routing guys will have a look. 5 days later I was informed that the Python guys have passed this to the routing guys, and today I got an email from a "routing guy" saying he could not recreate and asking for some more info. So yes they are going thru the right process, but it's taking forever. – Nitzan Shaked Mar 27 '13 at 08:39
  • 22 bytes is header . the body can be 0 byte. – Raptor Apr 08 '13 at 06:04
  • @ShivanRaptor gunicorn logs `Content-Length`, which is body size only. – robertklep Apr 08 '13 at 06:11
  • When communicating with Heroku support, quote the following to them: > With Heroku, you don’t need to think about servers at all. Its what it said on the website, and the reason why they have any customers at all. – WonderWorker Apr 18 '13 at 11:41
  • 1
    Small update: this hasn't yet been resolved. I've been corresponding back and forth with Heroku support, and the best I can gather right now is that they haven't dismissed me with "it's on your end", and are trying to write a tool that will tcpdump-capture app traffic, for "debugging cases like this". – Nitzan Shaked Apr 23 '13 at 16:28
  • @NitzanShaked: good to hear that. Please keep us informed, it's an interesting case of how Heroku responds to difficult problems. – liori Apr 26 '13 at 17:27
  • Hmm if you are prepared to spend this much time debugging the issue why not just get rid of Heroku and spool up some EC2 instances directly? Considering how easy it is to interact with AWS I can never seem to justify using Heroku for any reason! – David K Apr 29 '13 at 02:32
  • It's a long shot, but is it possible that null characters are involved? Some browsers/web-access-programs used to stop on null characters even if there was more data. –  Apr 30 '13 at 06:19
  • @BarryCarter I don't think that's it. First, there are no null characters involved. Second, it's not that the browser is seeing 0 bytes, it's that the *Heroku router* "sees" me returning 0 bytes, even though I am not (as proven by the Gunicorn logs). – Nitzan Shaked Apr 30 '13 at 07:34
  • Check content-type and character encoding. I've seen proxies choke on both of those. – Devon_C_Miller Apr 30 '13 at 13:53
  • 1
    Is it possible that your application returns both 'Content-Length: 22' and 'Transfer-Encoding: chunked' headers? This would be an incorrect response according to the HTTP standard, because the two headers are mutually exclusive and it could possibly cause the behavior you are observing. Gunicorn could log message length based on Content-Length and Heroku router could be trying to parse a chunk length from a body and fail, thus reporting a length of 0. – Jan Wrobel Apr 30 '13 at 17:44
  • Devon_C_Miller, JanWrobel -- both are good suggestions, but unfortunately I suspect that's not the case. If it were, I would expect to see this behavior from Heroku all the time. However, it only happens in about 1 in 1000 requests. My app returns the same responses in all 1000 cases, so... – Nitzan Shaked Apr 30 '13 at 19:26
  • maybe this can help: http://stackoverflow.com/questions/6458947/rails-3-1-heroku-cedar-static-image-assets-are-not-being-served – Dror Bereznitsky May 02 '13 at 13:04
  • @drorb -- thanks, but that's not it. For one, it happens on some occasions, and not always. But really, X-Sendfile is for static files, while I am serving dynamic results, and in any case do not include the X-Accel nor X-Sendfile headers in my response. – Nitzan Shaked May 03 '13 at 19:24
  • Maybe the client shouldn't give up receiving the answer if it gets 0 byte from the TCP channel for the first try. It should keep trying reading until it receives the suspected number of bytes (22) or the TCP channel is closed. This is a different exit criteria. – Csaba Toth May 07 '13 at 18:06
  • @CsabaToth thanks for the suggestion, but the Heroku router sends the client an empty body and then closes the socket: the client can't just continue reading, as the whole response is never sent. – Nitzan Shaked May 08 '13 at 12:50
  • @Nitzan Shaked This is very important: are you sure, that the Heroku router closes the connection, or just the software reading on the other end of the pipe bails out because it sees 0 bytes. I'm not saying that it's good to see 0 bytes, but when info is sent through the network, then both ends have the whole network stack in between them, not to speak about a whole bunch of network equipment (routers, transfer links, etc). So when the Heroku sends 22 bytes on the other end of the pipe, there's no guarantee it'll arrive in one piece for example to the other side of the same (app) level. – Csaba Toth May 08 '13 at 16:26
  • @Nitzan Shaked cont. - maybe it'll arrive in two 11 bytes piece. I don't know how much control you have on the software which reads the pipe on "your" side, but the low level code maybe could be worked around. It's important both on Heroku and the other side what are the exit conditions of the read/write loops. I'm talking about "low level" network handling code. – Csaba Toth May 08 '13 at 16:30
  • @CsabaToth -- I am sure. Heroku closes the connection. Not only that, but Heroku *itself* prints "0 bytes" in its logs. – Nitzan Shaked May 09 '13 at 04:35

1 Answers1

1

I know I may be considered a little off the wall here but there is another option.

We know that from time to time there is a bug that happens on transit.We know that there is not much we can do right now to stop the problem. If you are only providing the API then stop reading however if you write the client too, keep going.

The error is a known case, and known cause. The result of an empty return value means that something went wrong. However the value is available and was fetched, calculated, whatever... My instinct as a developer would be to treat an empty result as an HTTP error and request the data be resent. You could then track the resend requests and see how often this happens.

I would suggest (although you strike me as the kind of developer to think of this too) that you count the requests and set a sane value for responding "network error" to the user. My instinct would be to retry right away and then to wait a little while before retrying some more.

From what you describe the first retry would probably pick up the data properly. Of course this could mean keeping older requests hanging about in cache for a few minutes or running the request a second time depending on what seemed most appropriate.

This would also route around any number of other point-to-point networking errors and leave the app far more robust even in the face of connectivity problems.

I know our instinct as developers is to fix the known fault but sometimes it is better to work towards a system that is able to operate despite faults. That said it never hurts to log errors and problems and try to fix them anyway.

  • Actually that's not a bad comment (though probably should be in a comment and not in an answer), and don't think I haven't thought about it... The problem is that the client can't issue the request again, because the request may have server-side side-effects (like transferring money a second time, say). The solution for that is for the client to issue request_id's, and for the server to keep a list of "which request_id's have been served in the last 60sec". When a client gets a 200 response with 0 bytes body it re-issues the request with the same id, and the server doesn't re-perform (con't) – Nitzan Shaked May 27 '13 at 16:31
  • (con't) the whole thing again. However, that's so ugly that I chose not to implement. – Nitzan Shaked May 27 '13 at 16:32
  • I'm barely a beginner at caching, but it seems to me: Send a random string as part of the request, and cache the result. When you resend the request, with the same random string, you'll naturally fetch the cached result (same content, same source...), but when you send a legit new request, you have a new random string and thus no cached result. – Narfanator Jun 01 '13 at 07:38