0

I want to receive the html of a webpage given it's address

var url = "https://www.stackoverflow.com/questions"
var uri = new Uri(url);
var host = uri.Host;
client.Connect(host, 443);
using SslStream sslStream = new SslStream(client.GetStream(), 
    false,
    new RemoteCertificateValidationCallback(ValidateServerCertificate), 
    null
);

var message = @$"GET {uri.AbsolutePath} HTTP/1.1
Accept: text / html, charset = utf - 8
Connection: close
Host: {host}
" + "\r\n\r\n";
sslStream.AuthenticateAsClient(host);
using var reader = new StreamReader(sslStream, Encoding.UTF8);
byte[] bytes = Encoding.UTF8.GetBytes(message);
sslStream.Write(bytes, 0, bytes.Length);
var response = reader.ReadToEnd();

public static bool ValidateServerCertificate(
    object sender, 
    X509Certificate certificate,
    X509Chain chain, 
    SslPolicyErrors sslPolicyErrors)
{
    return true;
}

This code is very iconsistent, I can receive 302,301,403,200
I would like to understand what is causing this inconsistency and how it could be fixed.

IOEnthusiast
  • 105
  • 6
  • 2
    What are you using TcpClient? What about HttpClient? – mtkachenko Mar 20 '23 at 07:25
  • HttpClient is not allowed – IOEnthusiast Mar 20 '23 at 07:25
  • 1
    The error codes you mention are HTTP errors, not TCP errors. We'd need to know about the URLs you're trying to reach: what kind of requests do they require? Also the limitation of no HttpClient is super weird. If you're on .NET Framework then can you use HttpWebRequest instead? – Simmetric Mar 20 '23 at 08:12
  • 1
    I'd also recommend to test with `https://httpbin.org/get` – Falco Alexander Mar 20 '23 at 08:15
  • @IOEnthusiast what does `HttpClient is not allowed` even mean? Why? It matters. Without an actual explanation (there won't be one) the question should be closed. SO is about technical, not philosophical or political questions – Panagiotis Kanavos Mar 21 '23 at 08:58
  • Is this homework? Homework questions are acceptable provided they're clearly marked and clearly explain the problem. – Panagiotis Kanavos Mar 21 '23 at 09:00
  • It is, I wasn't told why httpclient is not allowed. I was wondering the same – IOEnthusiast Mar 21 '23 at 09:06
  • Since it's homework, the reason is that you need to get acquainted with how HTTP requests actually work, what they look like, and inspect the actual HTTP responses before HttpClient, HttpWebRequest or any other HTTP library parses them. When you get a 302 status code you also get a `Location` header with the URL you should try next for example – Panagiotis Kanavos Mar 21 '23 at 09:07

1 Answers1

-1
var message = @$"GET {uri.AbsolutePath} HTTP/1.1
Accept: text/html, charset=utf-8
Connection: close
User-Agent: C# program
Host: {host}
" + "\r\n\r\n";

User-Agent was required for websites like facebook and instagram that would throw, 302 unsupported browser.

301 - was because not every website has the www subdomain

403/401 - was the most obvious, some resources just aren't available, if you're not authenticated.

IOEnthusiast
  • 105
  • 6
  • So the real question is what the HTTP status codes mean? And why different sites return different things to what they perceive to be screen scrapers? Because some sites hate screen scrapers. No, 301 doesn't mean there's no www subdomain, it means [the domain has changed](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/301). No, 302 isn't an unsupported browser, it's another redirect, eg to the authentication page – Panagiotis Kanavos Mar 21 '23 at 09:01
  • I thought it was an issue with the tcp configuration, when I asked the question. – IOEnthusiast Mar 21 '23 at 09:04
  • 1
    You can find the various HTTP status codes in [Mozilla's Http Response Status Codes](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) docs – Panagiotis Kanavos Mar 21 '23 at 09:05
  • TCP doesn't return status codes. This answer is simply wrong. – Panagiotis Kanavos Mar 21 '23 at 09:08