Receive 403 error when attempting to scrape page

Question

my code below uses C# and HTMLAgilityPack to scrape a webpage and then uses WebClient to download a string from another webpage. This works great on localhost, but when I publish my code as an API service on Azure or execute it on a web hosting service (i.e. host gator), I always receive a 403 forbidden error. I've tried so many ways to get this to work and cannot for the life of me figure this out. Any help would be greatly appreciated.

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("https://antenati.cultura.gov.it/ark:/12657/an_ud18290200");
//string returnedResult = doc.DocumentNode.OuterHtml; //this shows a 403 forbidden error response when not running from localhost.
string ress = doc.DocumentNode.SelectSingleNode("//*[text()[contains(., 'manifestId:')]]").InnerText;

if (!string.IsNullOrEmpty(ress))
{
    string[] strPieces = ress.Split(new string[] { "manifestId:" }, StringSplitOptions.None);
    if (strPieces.Length >= 2)
    {
        WebClient wb = new WebClient();
        string manifestUrl = strPieces[1].Split(',')[0].Replace("'", "").Trim();
        wb.Headers.Add("origin", "https://antenati.cultura.gov.it");
        wb.Headers.Add("referer", "https://antenati.cultura.gov.it/");
        wb.Headers.Add("user-agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36");

        string result = wb.DownloadString(manifestUrl);
    }
}

Code I have tried that results in a 403 error on https://dotnetfiddle.net:

using System;
using System.IO;
using System.Net;
                    
public class Program
{
    public static void Main()
    {
        string text = "";
        HttpWebRequest request = (HttpWebRequest)WebRequest.Create("https://antenati.cultura.gov.it/ark:/12657/an_ud18290200");
        
        //request.Proxy = new WebProxy("173.192.21.89", 80);

        request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0";
        request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8";
        //request.Connection = "keep-alive";
        request.Headers.Add("Accept-Language", "en-US,en;q=0.5");
        //request.Headers.Add("Accept-Encoding", "gzip, deflate");
        request.Headers.Add("Upgrade-Insecure-Requests", "1");
        request.Headers.Add("Sec-Fetch-Dest", "document");
        request.Headers.Add("Sec-Fetch-Mode", "navigate");
        request.Headers.Add("Sec-Fetch-Site", "none");
        request.Headers.Add("Sec-Fetch-User", "?1");
        request.Headers.Add("Cache-Control", "max-age=0");      

        // Get the response.  
        WebResponse response = request.GetResponse();

        using (var sr = new StreamReader(response.GetResponseStream()))
        {
            text = sr.ReadToEnd();
        }

        Console.WriteLine(text);
    }
}

score -1 · Answer 1 · answered May 29 '23 at 14:45

There are 2 main causes for a 403 error:

The URL you are trying to scrape is forbidden, and you need to be authorized to access it.
The website detects that you are a scraper and returns a 403 Forbidden HTTP Status Code as a ban page.

These errors are common when you are trying to scrape websites protected by Cloudflare, as Cloudflare returns a 403 status code.

There are a couple of simple solutions for this:

Using Fake User Agents
Optimizing Request Headers
Using Proxies

Solutions:

Configure your scraper to send a fake user-agent with every request. Here's an example: {'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'}. when scraping at scale, we need to maintain a large list of user-agents and pick a different one for each request.
If the website has a more sophisticated anti-bot detection system in place you will also need to optimize the request headers. Instead of a simple fake user-agent like the one above, you need to use a more optimized header as shown here:

    {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
    "Cache-Control": "max-age=0",
    }

Finally, you can use a list of proxies and rotate through each of them. Here's an example:

    [
        'http://Username:Password@IP1:20000',
        'http://Username:Password@IP2:20000',
        'http://Username:Password@IP3:20000',
        'http://Username:Password@IP4:20000',          ]
    ]

Thank you for the insight Rajtilak, it is much appreciated. I tried adding the headers you suggested and even played around with trying to add some proxies, but I still keep getting the 403 error. I still don't understand why I can run the same code on localhost and have no issues. I updated my original post to show the modified code I tried to run with no success at dotnetfiddle.net — Jakal, May 30 '23 at 01:54

score -1 · Answer 2 · answered May 29 '23 at 20:04

if I understand well, you want to donwload the content of the viewer. In this case the viewer is Mirador. https://github.com/ProjectMirador. According to your page this is the code of the viewer

$(function() {
Mirador.viewer({
  language: 'it',          
  id: "mirador",
  window: { 
    allowClose: false,
    sideBarOpenByDefault: false
  },          
  windows: [{
    imageToolsEnabled: true,
    imageToolsOpen: true,            
    manifestId: 'https://dam-antenati.cultura.gov.it/antenati/containers/57Q3a8X/manifest',
    canvasId: 'https://antenati.cultura.gov.it/ark:/12657/an_ua18290105/5gn8Rvz'
  }],
  workspaceControlPanel: { enabled: false },
  workspace: {
    showZoomControls: true,
    type: 'mosaic'
  },  
  annotation: {
    adapter: (canvasId) => new Mirador.LocalStorageAdapter(`localStorage://?canvasId=${canvasId}`),
    exportLocalStorageAnnotations: false,
  }
}, [
  ...Mirador.miradorCanvasUpdatePlugin,
  ...Mirador.miradorViewerInfoPlugin,
  ...Mirador.miradorImageToolsPlugin,
  ...Mirador.annotationPlugins
]);

});

manifestId contains the url you want to download. If you follow that link you will get 403. Obviuosly you do not access. So, one way to get access to this file is throught a webbrowser, available in 4.7 framework. You can use webbrowser control to navigate to your url, so next you can modifie Mirador configuration adding donwload plugin https://github.com/ProjectMirador/mirador-dl-plugin, in this way i think you can invoke click event and download file. Hope to help

Receive 403 error when attempting to scrape page

2 Answers2