I run a few hosts on network A that make requests to servers (which I don't own) on network B, somewhere across the Internet. Unfortunately, many of these requests get corrupted. If I make the requests over unencrypted HTTP, I get strange errors that hint at a corrupt request. If I make the requests over HTTPS, I get SSL-level errors. I can reproduce the problem by running:
sh -e -c 'while true; do curl $SERVER > /dev/null; sleep 1; done'
Usually within 20 requests, curl fails with an error like "Unknown SSL protocol error" or "tlsv1 alert decrypt error". I can reproduce this on multiple hosts in network A, accessing multiple servers on network B. But I cannot reproduce from network A to other servers, or from other hosts to network B. In those cases, the loop runs forever with no errors.
So it's pretty clear my TCP stream is getting corrupted between A and B. This has been going on for over 3 days, by the way.
First question: How can this plausibly happen? TCP has packet-level checksums, and corrupt packets passing the checksum should be much rarer than I am seeing. Also, if I run a network capture, I don't see many retransmits (according to wireshark's tcp.analysis.retransmit filter), which you would expect if packets were being corrupted and failing the TCP checksum. I guess some router must be doing higher-level data mangling (NAT? transparent proxy?) and corrupting the data but fixing the checksum?
Second question: Are there any tools I can use to isolate the problem? I can't find any. If I knew the network topology and I could find HTTPS servers behind each hop between A and B, I could run my test on them. But I don't. What other test would show up network corruption?
I've contacted the owners of network A and network B, but they haven't been helpful so far.
Update: To anyone suggesting what kind of buggy device might be in the path, is there any way to detect this other than contacting the owner?