0

I am trying to query a website to scrape some information for my organization, this information is sat behind a login page which for now I am bypassing by logging into the browser using my organization credentials and this website stores the details in the cookies so in any subsequent visits am still logged in (I know this is a hit and miss solution but for my purposes it's fine. In the event am logged out I will just manually log back in through a browser session).

Within this site there are two sections I need to access:

  • /Memberships

    In order to retrieve a list of URL's

  • /Organisation?orgid=XXXXXX

    And individual organizational pages which are retrieved from the /Memberships page

Problem

Now for some strange reason during the call to /Memberships the HTML data retrieved is perfectly fine and I am able to get a list of all the child URL's.

string url = "https://www.ACME.com/Memberships";
var response = CallUrl(url).Result;

private static async Task<string> CallUrl(string fullUrl)
{
    HttpClient client = new HttpClient();
    ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls13;
    client.DefaultRequestHeaders.Accept.Clear();

    var response = client.GetStringAsync(fullUrl);
    return await response;
}

When I proceed to attempt to query any of the child URL's I don't get the HTML response I am expecting which would be the organization details. Instead am presented with the website login page (well the HTML from the login page).

The code used is pretty much the same as above but if we swap out the url variable for:

string url = "https://www.ACME.com/Organisation?orgid=XXXX";

Keep in mind in order to access both the /Memberships page and the individual /Organisation?orgid=XXXXXX pages one must be logged in.

So what's stumping me is why can I access /Memberships but not the other pages!?

maisyk
  • 21
  • 4
  • Why do you think logging in with a browser would also log in HttpClient? Have you done anything to have them actually share an authentication cookie or something along those lines? – mason Nov 23 '21 at 21:08
  • HttpClient is not linked to your browser and does not share its cookies. /Memberships does apparently not require authentication, but you need to log in programmatically and [save the cookies in your HttpClient's handler](https://stackoverflow.com/questions/17983992/httpclient-not-saving-cookies) in order to access the other pages. – CodeCaster Nov 23 '21 at 21:08

0 Answers0