0

I have programmatically downloaded the contents of a web page and hold it in a string variable. What is the best way to look for "og:image" meta tag content url?

E.g. assume a snippet from the view source of a page looks like below:

<meta property="og:site_name" content="The Christian Science Monitor"  />
<meta property="og:type" content="article"  />
<meta property="og:url" content="http://www.csmonitor.com/Business/2013/0729/Cannes-jewel-heist-53-million-in-diamonds-jewels-stolen-from-hotel"  />
<meta property="og:description" content="Cannes jewel heist saw $53 million in diamonds and other precious gems stolen from a hotel on the French Riviera. The Cannes jewel heist is the latest in a series of several brazen jewelry thefts in Europe in recent years."  />
<meta property="og:image" content="http://www.csmonitor.com/var/ezflow_site/storage/images/media/content/2013/0729-jewels/16474969-1-eng-US/0729-jewels.jpg"  />
<meta property="og:title" content="Cannes jewel heist: $53 million in diamonds, jewels stolen from hotel"  />
<meta name="sailthru.author" content="Thomas Adamson"  />

I would like to extract "http://www.csmonitor.com/var/ezflow_site/storage/images/media/content/2013/0729-jewels/16474969-1-eng-US/0729-jewels.jpg" string that is the target of "og:image" tag.

I could construct some logic in code to look for substrings and then take it from there but I would like to accomplish this with regular expression syntax similar to this:

List<Uri> links = new List<Uri>();
string regexImgSrc = @"<img[^>]*?src\s*=\s*[""']?([^'"" >]+?)[ '""][^>]*?>";

MatchCollection matchesImgSrc = Regex.Matches(htmlSource, regexImgSrc, RegexOptions.IgnoreCase | RegexOptions.Singleline);

This last example scrapes a web page source and extracts all the image tags. I would like to do the same with og:image tags but I am not very well-versed with regular expressions.

durron597
  • 31,968
  • 17
  • 99
  • 158
Archil Kublashvili
  • 696
  • 1
  • 8
  • 20
  • **Don't use regular expressions to parse HTML. Use a proper HTML parsing module.** You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. – Andy Lester Jul 29 '13 at 14:06
  • Thanks Andy. Can you suggest a proper HTML parsing module? I am in C# environment. – Archil Kublashvili Jul 29 '13 at 14:10
  • I'm sorry, I can't. I don't know the C# world at all. Search for "html parsing C#" in StackOverflow, or in Google, and see what you can find. – Andy Lester Jul 29 '13 at 14:15

1 Answers1

0

I don't think you should use regex, it can get kinda wacky depending on how they put it in the html. for example, the content= might be before the property=. I did using some regular code, I didn't want to use an html or xml parser plugin. Heres what I ended up doing.

Dictionary<string, string> metatags = new Dictionary<string, string>();
int TagStart,TagEnd;
string element;
int AttrStart, AttrEnd;
string PropVal,ContentVal;
TagStart = strIn.IndexOf("<meta", StringComparison.OrdinalIgnoreCase);
while(TagStart != -1) {
    TagEnd = strIn.IndexOf(">", TagStart + 1, StringComparison.OrdinalIgnoreCase);
    if (TagEnd != -1) {
        element = strIn.Substring(TagStart, TagEnd - TagStart + 1);
        //Console.WriteLine("\nPROCESSING META TAG: {0}",element);
        PropVal = null;
        ContentVal = null;

        // Get "property" attribute
        AttrStart = element.IndexOf("property=\"", StringComparison.OrdinalIgnoreCase);
        if (AttrStart != -1) {
            AttrStart = AttrStart + 10;
            AttrEnd = element.IndexOf("\"", AttrStart, StringComparison.OrdinalIgnoreCase);
            if(AttrEnd != -1) {
                PropVal = element.Substring(AttrStart, AttrEnd - AttrStart);
            }
        }
        // Get "content" attribute
        AttrStart = element.IndexOf("content=\"", StringComparison.OrdinalIgnoreCase);
        if(AttrStart != -1) {
            AttrStart = AttrStart + 9;
            AttrEnd = element.IndexOf("\"", AttrStart, StringComparison.OrdinalIgnoreCase);
            if(AttrEnd != -1) {
                ContentVal = element.Substring(AttrStart, AttrEnd - AttrStart);
            }
        }
        if (PropVal != null && ContentVal != null)
            metatags.Add(PropVal, ContentVal);

    }
    // go to next meta tag
    TagStart = strIn.IndexOf("<meta", TagStart + 1, StringComparison.OrdinalIgnoreCase);
}
Console.WriteLine("\nOG meta tags");
foreach(var item in metatags) {
    Console.WriteLine("KEY={0} VALUE={1}",item.Key,item.Value);
}