How to get online HTML-tables down locally automatically using C#

Question

Okay to keep it short:

I have some different websites with tables containing information that i would like to have "local" to query.

I have searched for possibilities and i have some ideas myself.

In Excel i found a function where i can navigate to a webpage and copy the data from a table. Problem is this only happens once. The data in the tables will be updated every week so i need Excel to be updated automatically every time i open my program.
I could use a crawler, but then i would have to write a different solution for every table and find a way to save it.

I have a MySQL database right know containing many of the information i need in my program, so if any solution requires a database that totally acceptable.

About my program: Will be written in C#, first as a local program and then to a MVC project. Suggestions for both projects are very welcome and if you need more information just commented, and i will try and describe it some more. :)

EDIT! 1

I am very sorry that I am didn't from the beginning enlightened you about which tables i were talking about, but when I started this question I still needed to find all the tables. Now, however, I have taken a few of them out to show you guys how different types of tables I have to work with. About the project, it should tell you that the program I have planned to make is only for private use and not to sell. I am not aware about the rules on crawling on public sites, so thats why i keep it private.

Table 2 Table 3

As you can see, it's a lot of soccer data showing in very different ways, so i need to know which way is best for me to gather the data because i believe it to be easier to design med database with this knowledge.

I posted a preliminary answer, but am wondering are these sites you have control of or public sites you do not own? Also can you elaborate on the database you have that contains much of the info you need, where did this data come from? Finally is excel necessary here or just mentioned because it sort of does what you need, but manually whereas you need (prefer) this to be more automated? This will help to make my answer more specific and provide some code samples and links to free tools, for example. — Dmitriy Khaykin, Feb 24 '12 at 20:12
@DavidKhaykin This a public sites which i can't control. The database is a MySQL 5. Yes, Excel is only here because i know of the inbuilt function. :) I hope thats enough info. The information i have know is only a few tables with no connection to each other. I looking into the ways i can build the database before i designet. :) — Anders Gerner, Feb 26 '12 at 14:51
It wont allow me to add the first table. Here it is: http://soccernet.espn.go.com/stats/_/league/eng.1/barclays-premier-league?cc=5739 — Anders Gerner, Feb 27 '12 at 21:08

score 7 · Accepted Answer · edited May 23 '17 at 12:05

7

Anders, Excel has an inbuilt way to fetch the data and you have to do that once. Next time you have to simply refresh the query. Please see this link.

html parsing of cricinfo scorecards

FOLLOWUP

Try look at this page: soccernet.espn.go.com/stats/_/league/eng.1/… There are 3 tables but it dosent seem that excel detect them. :( – Anders Gerner 7 mins ago

In this particular website, if you view the source, you will see that the table doesn't have an ID. All the three table have the same class "tablehead". If you want, in the workbook open event, loop though all the tables and extract the data. Your job is made easier as all the 3 tables have the same class.

Alternatively, you can also do this

In Excel, click on File | Open, and in the Dialog box, directly type the URL which you mentioned below. You will notice that Excel stacks up the data neatly :)

You can in fact write a small macro/code which opens a temp workbook and then opens the URL and then simply extract the tables from the temp workbook into your workbook. My estimate is that on a good internet connection, the entire process should not take more than 15 secs to complete

edited May 23 '17 at 12:05

Community

1
1

answered Feb 22 '12 at 02:19

Siddharth Rout

147,039
17
206
250

This was very helpfull but unforturnally has it come to my attention that the inbuilt function arent able to fetch all kind of tables. :( But the post you made on the other question was very very nice! – Anders Gerner Feb 26 '12 at 14:38
Try look at this page: http://soccernet.espn.go.com/stats/_/league/eng.1/barclays-premier-league?cc=5739 There are 3 tables but it dosent seem that excel detect them. :( – Anders Gerner Feb 27 '12 at 20:38
I tried using the File|Open and it seams to work great. This way its easy for me to just get the value from a specific cell :) I have tried a little Macro code before but not that much in Excel. Tried VBA but i'm not a big fan of Visual Basic. I hope its not to much to ask but could you help with a macro code that would refresh the data every time it opens. ? :) – Anders Gerner Feb 27 '12 at 21:20
Thanks mate.. Its only 11PM here. :) But get to bed, we cant have your wife divorcing you because you are being nice to me! :) Cheers. – Anders Gerner Feb 27 '12 at 22:05

score 1 · Answer 2 · answered Feb 14 '12 at 00:03

1

If I'm just reading web page information, I find the HtmlAgilityPack extremely helpful. It makes it easy to use LINQ to find certain tags with identifying information and then navigate sub-tags easily. So you could find a <table> tag and easily find <tr> and <td> and capture the Text property to find the contents of a cell.

answered Feb 14 '12 at 00:03

Jacob Proffitt

12,664
3
41
47

I agree. I'm already using the HAP for the data i get from the websites right now. :) I guess i'm looking for some more samples on both solutions. – Anders Gerner Feb 14 '12 at 18:24
@AndersGerner - HAP is XPATH **and** XSLT compatible. It means the coupling/binding between your various sources and your unique target (your SQL database) could be located in only one XSLT file per source. for each source, one transformation could then lead to one XML stream as output that you can easily inject in any database. The beauty of this kind of solution is you don't have to recompile anything when sources change, just update the XSLT to match the changes. The drawback is you need to understand a bit of XSLT and XPATH. – Simon Mourier Feb 25 '12 at 18:49
@SimonMourier I know some XPATH from using Selenium and HAP, but XSLT is new to me. Do yoy know any good sites to learn about it? It seems to be a nice way to solve my problem!. :) – Anders Gerner Feb 26 '12 at 14:34

score 1 · Answer 3 · answered Feb 21 '12 at 15:16

1

you can use visual web ripper, they have an API that you can use from .NET, and you can build template using their designer to pull the data you want, its very easy to use, my company have used it to pull reviews from sites even with paging and search.

answered Feb 21 '12 at 15:16

Mahmoud Darwish

1,168
1
15
28

I tried this Visual Web Ripper, but found it a bit to much. The way the show it on videos makes it look very easy, but i could find anything of gathering table data. Tried it, but no luck. Either is the program or me.. :P – Anders Gerner Feb 26 '12 at 14:36

Dmitriy Khaykin · Answer 4 · 2012-02-26T15:05:17.223

My approach would be to use a tool to generate an RSS feed for each of the URLs containing your table data, and then display the data in your UI (be it WPF, WinForms, or asp.net). This way you can easily set up additional 'channels' when you find/acquire a new website to pull data from, and your work will be to normalize the new site into your standard rss feed format (configurable in one of these tools), and you could even configure your UI to pull the additional feed based on a config setting, so no need to recompile when adding a new site.

You can decide to store the feed data in a DB or just display real time, and implement caching / refreshing of data at regular intervals automatically. I think the basic premise of the approach is to standardize the various table formats of each site into one common format (rss or otherwise) and then only worry about consuming one standard format in your app. This approach could be setup in a class library that presents the data in the common format, and then that class library can be consumed by both your C# app and your web application.

Edit: Here is a link to good info about several tools that can be used to create the RSS feed from any website: http://profy.com/2007/09/30/7-tools-to-make-an-rss-feed-of-any-website/

Nice approch but do you know a tool that could make these RSS feeds? :) — Anders Gerner, Feb 26 '12 at 14:40
I know of several, I will pick a couple and make an example. If you can provide at least two links to these sites, I can provide an example that you will be able to use to get started. — Dmitriy Khaykin, Feb 26 '12 at 14:58

score 0 · Answer 5 · edited May 23 '17 at 11:58

0

You can use Selenium (for automatic web testing). It is an extremely useful tool. It's API will allow you to do stuff like searching a specific table by XPath, CSS or by DOM.

You can drive Selenium through "remote control" by a lot of different languages. See: http://seleniumhq.org/projects/remote-control/

See for example for C#: http://www.theautomatedtester.co.uk/tutorials/selenium/selenium_csharp_nunit.htm

See StackoverFlow for some examples: How do I retrieve the text in a table column using Selenium RC?

edited May 23 '17 at 11:58

Community

1
1

answered Feb 25 '12 at 14:43

Rob Audenaerde

19,195
10
76
121

I have used Selenium before, but i dont know why i didn't think of trying this out.. I'll give it a go, and come back with the result :) Thank you! – Anders Gerner Feb 26 '12 at 14:42

score 0 · Answer 6 · answered Feb 26 '12 at 17:48

Here is some sample code using the HtmlAgilityPack:

using System;
using System.Collections.Generic;
using System.Web;
using System.Xml.XPath;

using HtmlAgilityPack;

namespace TableRipper
{
    class Program
    {
        static List<string> SerializeColumnSet(XPathNodeIterator columnSet)
        {
            List<string> serialized = new List<string>();

            while (columnSet.MoveNext())
            {
                string value = HttpUtility.HtmlDecode(columnSet.Current.Value.ToString().Trim());

                if (value.Contains(",") || value.Contains("\""))
                {
                    value = string.Concat('"', value.Replace("\"", "\"\""), '"');
                }

                serialized.Add(value);
            }

            return serialized;
        }

        static List<List<string>> RipTable(string url, string xpath, bool includeHeaders = true)
        {
            HtmlWeb web = new HtmlWeb();
            HtmlDocument document = web.Load(url);
            XPathNavigator navigator = document.CreateNavigator();
            XPathNodeIterator tableElementSet = navigator.Select(xpath);
            List<List<string>> table = new List<List<string>>();

            if (tableElementSet.MoveNext())
            {
                XPathNavigator tableElement = tableElementSet.Current;
                XPathNavigator tableBodyElement = tableElement.SelectSingleNode("tbody") ?? tableElement;
                XPathNodeIterator tableRowSet = tableBodyElement.Select("tr");
                bool hasRows = tableRowSet.MoveNext();

                if (hasRows)
                {
                    if (includeHeaders)
                    {
                        XPathNavigator tableHeadElement = tableElement.SelectSingleNode("thead");
                        XPathNodeIterator tableHeadColumnSet = null;

                        if (tableHeadElement != null)
                        {
                            tableHeadColumnSet = tableHeadElement.Select("tr/th");
                        }
                        else if ((tableHeadColumnSet = tableRowSet.Current.Select("th")).Count > 0)
                        {
                            hasRows = tableRowSet.MoveNext();
                        }

                        if (tableHeadColumnSet != null)
                        {
                            table.Add(SerializeColumnSet(tableHeadColumnSet));
                        }
                    }

                    if (hasRows)
                    {
                        do
                        {
                            table.Add(SerializeColumnSet(tableRowSet.Current.Select("td")));
                        }
                        while (tableRowSet.MoveNext());
                    }
                }
            }

            return table;
        }

        static void Main(string[] args)
        {
            foreach (List<string> row in RipTable(args[0], args[1]))
            {
                Console.WriteLine(string.Join(",", row));
            }
        }
    }
}

Tested against:

http://www.msn.com "//table[@summary='Market Update']"

http://www.worldclimate.com/cgi-bin/data.pl?ref=N48W121+2200+450672C "//table[1]"

It is far from perfect, for example it won't handle colspan or rowspan, but it is a start.

How to get online HTML-tables down locally automatically using C#

6 Answers6