-1

I see every code token in a pre element is surrounded by a span with styling. How could I, e.g. extract the line:

using MySql.Data.MySqlClient;

from the HTML:

<pre lang="cs" id="pre82739" style="margin-top: 0px;"><span class="code-keyword">using</span> MySql.Data.MySqlClient;</pre>

The extracted code need not include syntax highlighting or other makeup, it must just compile to do the task it is intended to do.

The point of this is building a means to automatically transfer code from web pages, and eventually other devices and formats, to a destination where it can be quickly used, such as send it to Skype where the user can just copy and past it into their code.

ProfK
  • 49,207
  • 121
  • 399
  • 775

2 Answers2

2

Well, if you want to strip the HTML and you're already trying (without success) regex, here's one I had to throw together for my own project you should be able to use.

/(<\/?(pre|span)[\s\S]*?>)|((style|class|id|lang|) ?\= ?['"][\s\S]*?['"])/gi

That's from a Javascript file so the wrapping may be different in C# but the actual expression should just about match. Adjust as you see fit.

Live demo of course.

Deryck
  • 7,608
  • 2
  • 24
  • 43
1

The HTML Agility Pack makes this trivially easy. This library allows you to query and navigate HTML like you would a XML document using XPath or LINQ.

using System;
using HtmlAgilityPack;

namespace ConsoleApplication1
{
    class Program
    {
        private static void Main(string[] args)
        {
            string html = "<pre lang=\"cs\" id=\"pre82739\" style=\"margin-top: 0px;\"><span class=\"code-keyword\">using</span> MySql.Data.MySqlClient;</pre>";

            var document = new HtmlDocument();
            document.LoadHtml(html);

            //"//pre" is the XPATH to find all tags in the document that are named `<pre>`
            foreach (var node in document.DocumentNode.SelectNodes("//pre"))
            {
                //prints "using MySql.Data.MySqlClient;"
                Console.WriteLine(node.InnerText);
                Console.WriteLine("--------------------------");
            }

            Console.ReadLine();
        }
    }
}

If you passed in a full HTML document it would call Console.WriteLine(node.InnerText); once per <pre> block. For example here is it parsing your own question (you should get 5 results as of this writing, that number may change if other users use more <pre> blocks.)

private static void Main(string[] args)
{
    var document = new HtmlDocument();
    using (var client = new WebClient())
    {
        var page = client.DownloadString("http://stackoverflow.com/questions/31894197");
        document.LoadHtml(page);
    }
    foreach (var node in document.DocumentNode.SelectNodes("//pre"))
    {
        Console.WriteLine(node.InnerText);
        Console.WriteLine("--------------------------");
    }

    Console.ReadLine();
}
Scott Chamberlain
  • 124,994
  • 33
  • 282
  • 431