0

I'm building a small scraper that navigates through a set of URL.

Currently I've something like:

public class MyScraper: WebScraper{
    private Queue<String> _urlToParse = new Queue<String>();

    public override void Init(){
        //Initializing _urlToParse with more than 1000 URLs
        Request(_urlToParse.Dequeue(), Parse);
    }
    
    public override void Parse(Response response){
        if(response.WasSuccessful){
            //...Parsing
        }else{
            //logging error
        }
        
        Request(_urlToParse.Dequeue(), Parse);
    }
}

But the Parse Method isn't called when I receive a 404 error.

Consequence:

  1. I cannot log the error(and when Going out the first Request call, I've no way to know if it has been successfull
  2. The next URL is not parsed

I was thinking that I would go to the Parse method with response.WasSuccessful = false and then be able to check the status code.

How should I do to handle this 404?

J4N
  • 19,480
  • 39
  • 187
  • 340

1 Answers1

1

The only way I could find to log the failed Url is to override the Log(string Message, LogLevel Type) method. There doesn't appear to be a good reason to have response.WasSuccessful. As you said it only appears to call Parse() when it is succesful.

public class MyScraper : WebScraper
{
    private Queue<string> _urlToParse = new Queue<string>();

    public override void Init()
    {
        _urlToParse.Enqueue("https://stackoverflow.com/");
        _urlToParse.Enqueue("https://stackoverflow.com/nothing");
        _urlToParse.Enqueue("https://google.com/");

        Request(_urlToParse.Dequeue(), Parse); 
    }

    public override void Parse(Response response)
    {            
        Console.WriteLine("Handeling response");

        if (_urlToParse.Count > 0)
        {
            Request(_urlToParse.Dequeue(), Parse);
        }            
    }

    public override void Log(string Message, LogLevel Type)
    {
        if (Type.HasFlag(LogLevel.Critical) & Message.StartsWith("Url failed permanently"))
        {
            Console.WriteLine($"Logging failed Url: {Message}");

            if (_urlToParse.Count > 0)
            {
                Request(_urlToParse.Dequeue(), Parse);
            }
        }
    }
}

Another option is it appears that WebScraper has a MaxHttpConnectionLimit that you could use to make sure it was only opening one connection at a time.

public class MyScraper : WebScraper
{
    public override void Init()
    {
        MaxHttpConnectionLimit = 1;

        var urls = new string[]
        {
            "https://stackoverflow.com/",
            "https://stackoverflow.com/nothing",
            "https://google.com/"
        };

        Request(urls, Parse); 
    }

    public override void Parse(Response response)
    {            
        Console.WriteLine("Handeling response");
    }

    public override void Log(string Message, LogLevel Type)
    {
        if (Type.HasFlag(LogLevel.Critical) & Message.StartsWith("Url failed permanently"))
        {
            Console.WriteLine($"Logging failed Url: {Message}");
        }

        base.Log(Message, Type);
    }
}
David Specht
  • 7,784
  • 1
  • 22
  • 30
  • The problem is that the call to "Request" is Asynchronous, so I cannot wait on the return to not ask too much request in parallel. – J4N Jan 29 '20 at 06:29
  • What do you mean you cannot wait on the return of an async call? If you're using async just use await keyword to wait for the response until you continue to the next one. Please see https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/async/#dont-block-await-instead for more information – Scircia Jan 29 '20 at 06:58
  • @Scircia I think it really starts to Request/Parse when you exit the `Init` method. `Request(...)` is not `async/await`, right? – J4N Jan 29 '20 at 07:30
  • I missed the need to throttle the throughput. I made a modification that will handle that. I just don't like that it depends on the log message to know if the Url failed. – David Specht Jan 29 '20 at 13:31
  • 1
    I added another option that I believe is how `IronWebScraper` intended for users to manage throughput. – David Specht Jan 29 '20 at 13:55