C# Web Crawler and NoSQL Database

Question

I'm an IT student and we need to develop a C# program that gets all the information from a website and then use NoSQL to add the information to an Oracle database. I've got a few questions and would really appreciate some help.

We decided to use the Autotrader (http://www.autotrader.co.za/) website and MongoDB for NoSQL.

So far I'm using the following code to write information from the website to a text file, but the problem is that it only gets information from the current page, and not the entire website.

HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse resp = (HttpWebResponse)req.GetResponse();
StreamReader sr = new StreamReader(resp.GetResponseStream());
string sourceCode = sr.ReadToEnd();
sr.Close();
resp.Close();
return sourceCode;

I would like to know how I can follow all the links and get all the information from the website (not just the current page), without going to any other websites.

Secondly, after I have all information from the website, how should I go about retrieving specific information for the Oracle database with MongoDB etc.

"C# Web Crawler and NoSQL Database" is not a question. Read [ask] — Amit, Aug 30 '15 at 15:07
you might want to take a look at my implementation. https://stackoverflow.com/a/16975398/1610747 — Misterhex, Jun 15 '17 at 02:02

score 0 · Answer 1 · edited Jun 08 '21 at 23:17

I can help with the first part of your question. You can use HtmlAgilityPack to find all the links in the web page you just scraped . you can read about how you can use do it from this question HTML Agility Pack get all anchors' href attributes on page.

basically what you need to do is Initialize an HttpDocument from the response stream then do.

   var nodes =
            _htmlDocument.DocumentNode.SelectNodes("//a[@href]");
        if(nodes != null)
           var links = nodes.Select(a => a.Attributes["href"])
                         .Select(a => a.Value).Distinct();

Once you have your list of URLs, you can recursivly call the scraper function for each of the urls.

Here is more information about HtmlAgilityPack. http://www.codeproject.com/Articles/659019/Scraping-HTML-DOM-elements-using-HtmlAgilityPack-H

Thanks for the quick response Yohannes, I'll start working an that for now. — Smiel, Aug 30 '15 at 15:31

score 0 · Answer 2 · answered Aug 30 '15 at 16:04

0

I would use an existing Web Automation library such as Selenium WebDriver (a good example here http://seleniumdotnet.blogspot.co.uk/) to drive the web page - these allow you to query and drive the web page like a user would.

answered Aug 30 '15 at 16:04

PhillipH

6,182
1
15
25

C# Web Crawler and NoSQL Database

2 Answers2