2

When building a knowledge graph, the first step (if I understand it correctly), is to collect structured data, mainly RDF triples written by using some ontology, for example, Schema.org. Now, what is the best way to collect these RDF triples?

Seems two things we can do.

  1. Use a crawler to crawls the web content, and for a specific page, search for RDF triples on this page. If we find them, collect them. If not, move on to the next page.

  2. For the current page, instead of looking for existing RDF triples, use some NLP tools to understand the page content (such as using NELL, see http://rtw.ml.cmu.edu/rtw/).

Now, is my understanding above (basically/almost) correct? If so, why do we use NLP? why not just rely on the existing RDF triples? Seems like NLP is not as good/reliable as we are hoping… I could be completely wrong.

Here is another try of asking the same question

Let us say we want to create RDF triples by using the 3rd method mentioned by @AKSW, i.e., extract RDF triples from some web pages (text).

For example, this page. If you open it and use "view source", you can see quite some semantic mark-ups there (using OGP and Schema.org). So my crawler can simply do this: ONLY crawl/parse these mark-ups, and easily change these mark-ups into RDF triples, then declare success, move on to the next page.

So what the crawler has done on this text page is very simple: only collect semantic markups and create RDF triples from these markup. It is simple and efficient.

The other choice, is to use NLP tools to automatically extract structured semantic data from this same text (maybe we are not satisfied with the existing markups). Once we extract the structured information, we then create RDF triples from them. This is obviously a much harder thing to do, and we are not sure about its accuracy either (?).

What is the best practice here, what is the pros/cons here? I would prefer the easy/simple way - simply collect the existing markup and change that into RDF content, instead of using NLP tools.

And I am not sure how many people would agree with this? And is this the best practice? Or, it is simply a question of how far our requirements lead us?

unor
  • 92,415
  • 26
  • 211
  • 360
lee
  • 234
  • 2
  • 16
  • Your question is strange and I don't know what exactly you want to hear...obviosuly, either i) you create the RDF triples manually ii) or you reuse existing RDF data or iii) you extract RDF data from some other kind of source, e.g. relational databases, XML files, text ..."why do we NLP"? How else do you want to automatically extract structured semantic data from text, which in fact is simply unstructured data? Of course, you don't have to - you can crawl the web and generate RDF triples manually for each web page - might be time-consuming, right? – UninformedUser Mar 02 '18 at 04:27
  • my bad, did not ask the question clearly, let me try again, and please see the question above. – lee Mar 02 '18 at 05:17
  • There is no unique answer. 1) you even don't know how the existing semantic markup was created (it could also have been generated automatically), thus, no guarantee for correctness resp. high quality 2) doing your own data extraction from text can increase the number of RDF facts that you can find - clearly, this task is ongoing research and there are limitations in what can be achieved. Even if the NLP tools would have a 100% accuracy, which is not the case, the mapping from those extracted stuctures to RDF triples is non-trivial. – UninformedUser Mar 02 '18 at 07:03
  • That means, although you could generate more data, the quality might be poor and it still would need some manual curation by a user. – UninformedUser Mar 02 '18 at 07:04
  • 3
    Just FYI (probably you know): embedded RDF are usually in [tag:rdfa] or [tag:json-ld] formats. As to NLP, possibly you could try [tag:spotlight-dbpedia]. And possibly [Ontotext NOW](http://now.ontotext.com/#channel?uri=http%3A%2F%2Fwww.ontotext.com%2Fpublishing%23International&type=channel) will be interesting for you. – Stanislav Kralin Mar 02 '18 at 08:06
  • appreciate all the answers. So sounds like NLP will be needed here. Could not help but wondering how google created the Google Knowledge Graph - did they use NLP or simply based on freebase? or some kind of mixture? If they used NLP, then how about creating RDF triples after that? Cannot imagine the work load... a follow-up question - what would be some good NLP tools one can use? I will definitely check out spotlight-dbpedia and Ontotext NOW... any "entry-level" ones? Again, appreciate all your help! many thanks go to @unor who helped to edit my original question to make it look better... – lee Mar 03 '18 at 00:57
  • 2
    Please take a look at https://github.com/ldspider/ldspider and my wrapper for it https://github.com/berezovskyi/ldspider-runner if you are into crawling. For a Java-based RDF ORM, look at https://github.com/eclipse/lyo-store or https://bitbucket.org/openrdf/alibaba or https://github.com/stardog-union/pinto or https://github.com/mhgrove/Empire or https://github.com/cyberborean/rdfbeans – berezovskyi Mar 03 '18 at 21:25
  • LDSpider has "Any23 handlers for other RDF serialisations, e.g. RDFa", and the ORMs are for you if you will be planning to extract RDF manually (in that case, also be sure to check out http://rml.io/). Finally, a shameless plug - we are working on a totally opposite concept to knowledge extraction: Eclipse Lyo project aims to help you to expose all of your existing tools as linked data microservices not only to extract knowledge, but to bidirectionally integrate RDF and non-RDF worlds. – berezovskyi Mar 03 '18 at 21:31

1 Answers1

1

Your question is unclear, because you did not state your data source, and all the answers on this page assumed it to be web markup. This is not necessarily the case, because if you are interested in structured data published according to best practices (called Linked Data), you can use so-called SPARQL endpoints to query Linked Open Data (LOD) datasets and generate your knowledge graph via federated queries. If you want to collect structured data from website markup, you have to parse markup to find and retrieve lightweight annotations written in RDFa, HTML5 Microdata, or JSON-LD. The availability of such annotations may be limited on a large share of websites, but for structured data expressed in RDF you should not use NLP at all, because RDF statements are machine-interpretable and easier to process than unstructured data, such as textual website content. The best way to create the triples you referred to depends on what you try to achieve.

Leslie Sikos
  • 529
  • 2
  • 7