When building a knowledge graph, the first step (if I understand it correctly), is to collect structured data, mainly RDF triples written by using some ontology, for example, Schema.org. Now, what is the best way to collect these RDF triples?
Seems two things we can do.
Use a crawler to crawls the web content, and for a specific page, search for RDF triples on this page. If we find them, collect them. If not, move on to the next page.
For the current page, instead of looking for existing RDF triples, use some NLP tools to understand the page content (such as using NELL, see http://rtw.ml.cmu.edu/rtw/).
Now, is my understanding above (basically/almost) correct? If so, why do we use NLP? why not just rely on the existing RDF triples? Seems like NLP is not as good/reliable as we are hoping… I could be completely wrong.
Here is another try of asking the same question
Let us say we want to create RDF triples by using the 3rd method mentioned by @AKSW, i.e., extract RDF triples from some web pages (text).
For example, this page. If you open it and use "view source", you can see quite some semantic mark-ups there (using OGP and Schema.org). So my crawler can simply do this: ONLY crawl/parse these mark-ups, and easily change these mark-ups into RDF triples, then declare success, move on to the next page.
So what the crawler has done on this text page is very simple: only collect semantic markups and create RDF triples from these markup. It is simple and efficient.
The other choice, is to use NLP tools to automatically extract structured semantic data from this same text (maybe we are not satisfied with the existing markups). Once we extract the structured information, we then create RDF triples from them. This is obviously a much harder thing to do, and we are not sure about its accuracy either (?).
What is the best practice here, what is the pros/cons here? I would prefer the easy/simple way - simply collect the existing markup and change that into RDF content, instead of using NLP tools.
And I am not sure how many people would agree with this? And is this the best practice? Or, it is simply a question of how far our requirements lead us?