0

I'm trying to scrape pages, find their schema.org script, then deserialize it.

I am able to find the script, however, valid JSON schema (according to Google/schema.org) is supposedly invalid in most Json Validator tools.

For example, this is my code

    string Url = "https://www.independent.co.uk/news/health/nhs-pay-health-coronavirus-unions-b1812659.html";
    HtmlWeb web = new HtmlWeb();
    HtmlDocument doc = web.Load(Url);
    var scripts = doc.DocumentNode.SelectNodes("//script");
    foreach (HtmlNode node in scripts)
    {
        string value = node.InnerText;
        if (value.Contains("schema.org"))
        {
            dynamic results = JsonConvert.DeserializeObject<dynamic>(value);
            var name = results.name;
        }
    }

Which finds the following Schema (JSON)

{{
  "@type": "Organization",
  "@context": "https://schema.org",
  "name": "The Independent",
  "url": "https://www.independent.co.uk",
  "logo": {
    "@type": "ImageObject",
    "url": "https://www.independent.co.uk/img/logo.png",
    "width": 504,
    "height": 60
  },
  "sameAs": [
    "https://twitter.com/Independent",
    "https://www.facebook.com/TheIndependentOnline"
  ]
}}

#1 The JSON is supposedly invalid, even though every website using structured data uses it like this

#2 When I try to get the name value, it returns null.

I assume my problems are because the JSON is invalid. How do I make this work? I'm out of ideas.

MattHodson
  • 736
  • 7
  • 22
  • The issue here are the extra curly brackets around the object. A Json object needs to have properties, it is not valid json if you have an object with just another object nested inside, like this: `{ }` is a valid, and empty, json object. This, however, is not valid: `{{ }}`. – Lasse V. Karlsen Mar 05 '21 at 11:57

1 Answers1

1

You need to get rid of the extra curly brackets at the start and end of the JSON to make it valid JSON.

Dean Potter
  • 55
  • 1
  • 6
  • Can I ask, why does schema.org tell us to format like this? Also, why does Google pass it in validation tests? – MattHodson Mar 05 '21 at 11:56
  • it shouldn't be telling you to format like that as if you inspect the source of that website you can see the schema starts off as follows: – Dean Potter Mar 05 '21 at 12:01
  • Also if you are referring to this google tool - https://search.google.com/structured-data/testing-tool It fails the checks for me? – Dean Potter Mar 05 '21 at 12:02
  • I've just realized through debugging... When I serialize into "results" it adds the extra curlies. I thought I was initially scraping it like this. Why in the world would it do that, grr. – MattHodson Mar 05 '21 at 12:03
  • haha glad I could be of assistance. Good luck sorting it out! – Dean Potter Mar 05 '21 at 12:08
  • But seeing as you know the response you are expecting could you not just build a model class with those properties of name, type etc and deserialize to that rather than dynamic? – Dean Potter Mar 05 '21 at 12:10
  • The problem is, the website will be dynamic, so those values will change. – MattHodson Mar 05 '21 at 12:28