0

The webscraper from the library works in htmlnodes, it's hard to explain but I am sort of scraping a tag and then the inside and I want to handle the inside like an array, which it is by default in this library but the issue is, I can iterate over it with a "for loop" like any other array, but I cannot access it with an index for some reason...

this is my code with the website link exactly like the documentation of the library uses:

In the main:

    static void Main(string[] args) {
        var scraper = new HelloScraper();
        scraper.Start();
    }

then Init:

    public override void Init() {
        this.LoggingLevel = WebScraper.LogLevel.None;
        this.Request("https://1337x.to/sort-search/Aquaman/time/desc/1/", Parse);
    }

And now the Parse which gives me trouble and I will split it to show what works and what doesn't. This works:

       public override void Parse(Response response) {
            foreach (var torrentLink in response.Css("tr")) {
                HtmlNode[] torrentContents = torrentLink.Css("td");
                for (int i = 0; i < torrentContents.Length; i++) {
                    Console.WriteLine($"{i}: {torrentContents[i].InnerText}");
                }
                Console.WriteLine();
            }
        }

To make it easier to understand I will talk about a single "torrent" here. this working piece of code produces:

0: Aquaman IMAX (2019) AC3 5.1 ITA.ENG 1080p H265 sub NUita.eng Sp33dy94 MIRCrew1
1: 7
2: 0
3: 8pm Oct. 2nd
4: 4.2 GB7
5: Sp33dy94

but this piece of code which basically selects what I need based on the same array with the indexes that I can see that work from the for loop:

       public override void Parse(Response response) {
            foreach (var torrentLink in response.Css("tr")) {
                HtmlNode[] torrentContents = torrentLink.Css("td");
                string torrentName = torrentContents[0].InnerText;
                string torrentSeeds = torrentContents[1].InnerText;
                string torrentSize = torrentContents[4].InnerText;
                Console.WriteLine($"{torrentName} --> [Size:{torrentSize} | Seeds:{torrentSeeds}]");
                Console.WriteLine();
            }
        }

this produces nothing... console doesn't display an error, and when I tried to debug it, it looks as when I try to access by index it "points to a null reference".

Maybe I am missing something, but if an array can be access by index in a for loop, it should be accessible outside of it too, am I wrong? what is the issue here?

btw I don't know whether 1337x.to allows web scraping or not, but I am not intending nor to use this commercially or myself, it is just a website I chose to practice with...

David Shnayder
  • 333
  • 4
  • 14

1 Answers1

1

After many hours of messing around in the debugger I got it, when I iterate with a for loop, it skips empty array, and the first was empty, it is the title of the page table, which has no values inside. adding a simple if statement to check whether the length is more than 0 fixes the issue:

public override void Parse (Response response) {
    foreach (var torrentLink in response.Css ("tr")) {
        HtmlNode[] torrentContents = torrentLink.Css ("td");
        if (torrentContents.Length > 0) {
            string torrentName = torrentContents[0].InnerText;
            string torrentSeeds = torrentContents[1].InnerText;
            string torrentSize = torrentContents[4].InnerText;
            Console.WriteLine ($"{torrentName} --> [Size:{torrentSize} | Seeds:{torrentSeeds}]");
            Console.WriteLine ();
        }
    }
}
David Shnayder
  • 333
  • 4
  • 14