6

I have been trying to explore RDF Triple Store feature and Semantic Search capabilities of Marklogic 7 and then querying using SPARQL. I was able to perform some basics operations on such as:

xquery version "1.0-ml";
import module namespace sem = "http://marklogic.com/semantics"at"/MarkLogic/semantics.xqy";
sem:rdf-insert(sem:triple(sem:iri("http://example.org/ns/people#m"),
sem:iri("http://example.com/ns/person#firstName"), "Sam"),(),(),"my collection")

which creates a triple, and then query it using the following SPARQL:

PREFIX ab: <http://example.org/ns/people#>
PREFIX ac: <http://example.com/ns/person#>
SELECT ?Name
WHERE
{ ab:m ac:firstName ?Name . }

which retrieves Sam as result. Edited: In my use case, I have a delimited file (Structured data) having 1 Billion records that I ingested into ML using MLCP which is stored in ML for instance as:

<root>
<ID>1000-000-000--000</ID>
<ACCOUNT_NUM>9999</ACCOUNT_NUM>
<NAME>Vronik</NAME>
<ADD1>D7-701</ADD1>
<ADD2>B-Valentine</ADD2>
<ADD3>Street 4</ADD3>
<ADD4>Fifth Avenue</ADD4>
<CITY>New York</CITY>
<STATE>NY</STATE>
<HOMPHONE>0002600000</HOMPHONE>
<BASEPHONE>12345</BASEPHONE>
<CELLPHONE>54321</CELLPHONE>
<EMAIL_ADDR>abc@gmail.com</EMAIL_ADDR>
<CURRENT_BALANCE>10000</CURRENT_BALANCE>
<OWNERSHIP>JOINT</OWNERSHIP>
</root>

Now, I want to use RDF/Semantic feature for my dataset above. However, I am not able to understand whether I need to convert the above doc to RDF as shown below (shown for <NAME>) assuming this to be a right way:

  <sem:triple>
    <sem:subject>unique/uri/Person
    </sem:subject>
    <sem:predicate>unique/uri/Name
    </sem:predicate>
    <sem:object datatype="http://www.w3.org/2001/XMLSchema#string"
    xml:lang="en">Vronik
    </sem:object>
  </sem:triple> 

and then ingest these docs in ML and search using SPARQL, or do I need to just ingest my documents and then separately ingest triples obtained from external sources and somehow (how..??) link them to my documents and then query using SPARQL? Or is there some other way that I ought to do this?

Shrey Shivam
  • 1,107
  • 1
  • 7
  • 16
  • I'd expect the XML based on the document to be something more like: ` :id "1000-000-000--000" ; :accountNum "9999"^^xsd:int ; :name "Vronik" ; :add1 "D7-701" ; ... ; :ownership :JOINT .` – Joshua Taylor Nov 19 '13 at 15:01
  • Is that meant to be XML, Joshua? It looks more like N3. Shrey posted his example in the `sem:triple` schema, which is how MarkLogic stores triples. It can read RDF-XML, NTriple, N3, etc. via http://docs.marklogic.com/sem:rdf-parse - but it isn't clear that Shrey needs that. – mblakele Nov 19 '13 at 15:15
  • @mblakele @Joshua Tayler :updated my qn.Basically I have a **delimited file**, which i ingest via ML. `sem:triple` is my understanding, is this the right way my original doc should be _converted to_ and then ingested?I would like to perform bulk load/transform as I have around a billion records – Shrey Shivam Nov 20 '13 at 06:54

2 Answers2

4

It's up to you. If you want to use XML for some facts and triples for others, you can transform selected facts from XML to triples, and combine those in the same documents. For the XML you presented, that's how I'd start. As you insert or update each document in the original XML format, pass it through XQuery that adds new triples. I'd keep those new triples in the same document with the original XML.

You could do this using CPF: http://docs.marklogic.com/guide/cpf - or with a tool like http://marklogic.github.io/recordloader/ and its XccModuleContentFactory class.

But if you want to get away from the original XML format entirely, you could do that. Then you would translate your XML into triples and ingest those triples instead of the original XML. Or you can also have pure XML documents and pure triple documents in the same database.

mblakele
  • 7,782
  • 27
  • 45
  • What could have been unclear previously is that my source is a _delmited structured file_ that I ingest into ML using MLCP.I want to associate this dataset with RDF and leverage semantic capabilities of ML 7. I donot know the _best practice_ and right way to achieve this and how?I am looking into cpf that u said, could you also elaborate on how i can make use of it so that i can have a good start !! – Shrey Shivam Nov 20 '13 at 13:38
  • It sounds like you would want something along the lines of the CPF enrichment pipeline or XSLT pipeline, but customized for your use-case. The XSLT primer at http://developer.marklogic.com/blog/the-royal-road-to-auto-applying-xslt might help you get started, but there will be a fair amount of custom coding involved. You might also look at RecordLoader: you might find it more straightforward to work with. – mblakele Nov 20 '13 at 18:37
  • recordloader seems to be similar to MLCP. How can this tool be used for this case in specific?Also are there no ways where I dont have to jump into XSLT transformation?Plus, how can I add triple to each document using Java, could you explain it with reference to my example. I was looking at points mentioned by @SBuxton but i am stuck at point 2. he says to ingest my documents as it is, then add triples to it..otherwise, I have ingested geonames rdf and looking for the solution to the former.ELH !!! – Shrey Shivam Nov 21 '13 at 10:28
  • You're asking for much more than I can answer in 500 chars. See http://marklogic.github.io/recordloader/ and look for `XccModuleContentFactory`. – mblakele Nov 21 '13 at 17:32
3

As Michael says, there are many ways you could go with this. That's because MarkLogic 7 is so flexible - you can express information as triples or as XML (or as JSON or ...) and mix'n'match data models and query languages

The first thing to figure out is - what are you trying to achieve? If you just want to get your feet wet with MarkLogic's mix of XML and triples, here's what I'd suggest:

  1. ingest your XML documents as above. If you have something text-heavy such as a description of the account or a free-text annotation, so much the better.

  2. Using XQuery or XSLT, add a triple to each document that represents the city e.g. for the sample document you posted, add

    --this document URI-- unique/uri/Location New York

  3. import triples from the web that map city names to states and zip codes (e.g. from geonames)

  4. now with a mixture of SPARQL and XQuery you can search for e.g. the current balance of every account in some zip code (even though your documents don't contain zip codes).

The documentation gives a good description of loading triples from external sources using mlcp.

See http://docs.marklogic.com/guide/semantics/setup

and for more detail on loading triples see http://docs.marklogic.com/guide/semantics/loading

Note too that you can now run either XQuery or SPARQL (or SQL) queries directly from Query Console at http://your-host:8000/qconsole/

mblakele
  • 7,782
  • 27
  • 45
SBuxton
  • 166
  • 2
  • thnx.clears several things.However, updated in qn, I have a delimited file that I ingest via MLCP. I am wondering how can I implement _point 2_ that u mentioned where I should add triple to each document?Do i have to do some **pre-processing**(using custom code or are there some useful open source transformation tools) on my entire dataset and then _update_ my docs. On reading thru Sematic Guide i figured that `sem:rdf-insert etc`are update funcs, but are they suitable for bulk update in billions?Plus,should my RDF DB & doc DB be separate or a triple index can exist in the doc DB? – Shrey Shivam Nov 20 '13 at 07:14