I'm trying to scrape pages, find their schema.org script, then deserialize it.
I am able to find the script, however, valid JSON schema (according to Google/schema.org) is supposedly invalid in most Json Validator tools.
For example, this is my code
string Url = "https://www.independent.co.uk/news/health/nhs-pay-health-coronavirus-unions-b1812659.html";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
var scripts = doc.DocumentNode.SelectNodes("//script");
foreach (HtmlNode node in scripts)
{
string value = node.InnerText;
if (value.Contains("schema.org"))
{
dynamic results = JsonConvert.DeserializeObject<dynamic>(value);
var name = results.name;
}
}
Which finds the following Schema (JSON)
{{
"@type": "Organization",
"@context": "https://schema.org",
"name": "The Independent",
"url": "https://www.independent.co.uk",
"logo": {
"@type": "ImageObject",
"url": "https://www.independent.co.uk/img/logo.png",
"width": 504,
"height": 60
},
"sameAs": [
"https://twitter.com/Independent",
"https://www.facebook.com/TheIndependentOnline"
]
}}
#1 The JSON is supposedly invalid, even though every website using structured data uses it like this
#2 When I try to get the name value, it returns null.
I assume my problems are because the JSON is invalid. How do I make this work? I'm out of ideas.