How can I extract original code from a `pre` tag?

Question

I see every code token in a pre element is surrounded by a span with styling. How could I, e.g. extract the line:

using MySql.Data.MySqlClient;

from the HTML:

<pre lang="cs" id="pre82739" style="margin-top: 0px;"><span class="code-keyword">using</span> MySql.Data.MySqlClient;</pre>

The extracted code need not include syntax highlighting or other makeup, it must just compile to do the task it is intended to do.

The point of this is building a means to automatically transfer code from web pages, and eventually other devices and formats, to a destination where it can be quickly used, such as send it to Skype where the user can just copy and past it into their code.

if you're trying to parse out the text from all the markup there are a bunch of libraries for each language, google should help you with that ;). — dbarnes, Aug 08 '15 at 14:42
I am currently trying using regex and other ugly HTML removal code, but wanted to ask early before I am too far down the wrong road. — ProfK, Aug 08 '15 at 14:44
@dbarnes I'm currently trying to find a google term that doesn't bloody tell me how to parse HTML with C#, my default and first effort language. — ProfK, Aug 08 '15 at 14:46

score 2 · Answer 1 · answered Aug 08 '15 at 15:02

Well, if you want to strip the HTML and you're already trying (without success) regex, here's one I had to throw together for my own project you should be able to use.

/(<\/?(pre|span)[\s\S]*?>)|((style|class|id|lang|) ?\= ?['"][\s\S]*?['"])/gi

That's from a Javascript file so the wrapping may be different in C# but the actual expression should just about match. Adjust as you see fit.

Live demo of course.

Scott Chamberlain · Answer 2 · 2015-08-08T16:26:00.747

The HTML Agility Pack makes this trivially easy. This library allows you to query and navigate HTML like you would a XML document using XPath or LINQ.

using System;
using HtmlAgilityPack;

namespace ConsoleApplication1
{
    class Program
    {
        private static void Main(string[] args)
        {
            string html = "<pre lang=\"cs\" id=\"pre82739\" style=\"margin-top: 0px;\"><span class=\"code-keyword\">using</span> MySql.Data.MySqlClient;</pre>";

            var document = new HtmlDocument();
            document.LoadHtml(html);

            //"//pre" is the XPATH to find all tags in the document that are named `<pre>`
            foreach (var node in document.DocumentNode.SelectNodes("//pre"))
            {
                //prints "using MySql.Data.MySqlClient;"
                Console.WriteLine(node.InnerText);
                Console.WriteLine("--------------------------");
            }

            Console.ReadLine();
        }
    }
}

If you passed in a full HTML document it would call Console.WriteLine(node.InnerText); once per <pre> block. For example here is it parsing your own question (you should get 5 results as of this writing, that number may change if other users use more <pre> blocks.)

private static void Main(string[] args)
{
    var document = new HtmlDocument();
    using (var client = new WebClient())
    {
        var page = client.DownloadString("http://stackoverflow.com/questions/31894197");
        document.LoadHtml(page);
    }
    foreach (var node in document.DocumentNode.SelectNodes("//pre"))
    {
        Console.WriteLine(node.InnerText);
        Console.WriteLine("--------------------------");
    }

    Console.ReadLine();
}

How can I extract original code from a `pre` tag?

2 Answers2

Live demo of course.