scraping data from website with a C# console application

Question

I'm trying to learn Spanish and making some flash cards (for my personal use) to help me learn the verbs.

Here is an example, page example. So near the top of the page you will see the past participle: bloqueado & gerund: bloqueando. It is these two values that I wish to obtain in my code and use for my flash cards.

If this is possible I will use a C# console application. I am aware that scraping data from a website is not ideal however this is a once off.

Any guidance on how to start something like this and pitfalls to avoid would be very helpful!

So what have you tried, and where exactly are you stuck? Got any code yet that you might show? — bassfader, Apr 06 '17 at 10:43
well i tried looking at the html of the webpage to see if i could parse it however I couldn't see the fields I required in the html. So otherwise have been reading to see other way to scrap data but not using some third party application — mHelpMe, Apr 06 '17 at 10:45
What do you mean by *"however I couldn't see the fields I required in the html"*? What fields do you mean? When looking at the HTML using the Chrome Developer Tools I easily found these values / words, they are all listed within the following section tag: `
`. To me it is still very unclear what exactly you are having problems with... — bassfader, Apr 06 '17 at 10:54
ah thanks! I can now see them in the html. Well I didn't think the values where in the HTML so was wondering how to scrape the data from the website. Now I can see the values I will google how to get the html of a webpage — mHelpMe, Apr 06 '17 at 11:18

score 0 · Answer 1 · answered Apr 06 '17 at 12:54

I know this isn't an exact answer, but here is the process I would suggest.

https://www.gnu.org/software/wget/ and mirror the website to a folder. Wget is a web spider and will follow the links on the site until it has downloaded everything. You'll have to run it with a few different parameters until you figure out the correct settings you want.
Use C# to run through each file in the folder and extract the words from <section class="verb-mood-section"> in each file. It's your choosing of whether you want to output them to the console or store them in a database or flat file.

Should be that easy, in theory.

score 0 · Answer 2 · answered Jan 14 '19 at 13:11

Use SGMLReader. SGMLReader is a versatile and robust component that will stream HTML to an XMLReader:

XmlDocument FromHtml(TextReader reader) {

    // setup SgmlReader
    Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
    sgmlReader.DocType = "HTML";
    sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
    sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
    sgmlReader.InputStream = reader;

    // create document
    XmlDocument doc = new XmlDocument();
    doc.PreserveWhitespace = true;
    doc.XmlResolver = null;
    doc.Load(sgmlReader);
    return doc;
}

You can see that you need to create a TextReader first. TThis would in reality be a StreamReader as a TextReader is an abstract class.

Then you create the XMLDocument over that. Once you've got it into the XMLDocument you can use the various methods supported by XMLDocument to isolate and extract the nodes you need. I'll leave you to explore that aspect of it.

You might try using the XDocument class as it's a lot easier to handle than the XMLDocument, especially if you're a newbie. It also supports LINQ.

scraping data from website with a C# console application

2 Answers2