Adding extra data to XML documents in Marklogic

Question

I have a large set of XML documents in Marklogic that contain a so called ‘smart number’ (ex. First 2 characters represent Department, second 3 represent project etc.). Parsing the required information from the numbers is pretty complex and requires database look ups and such. We have a java process that handles the parsing. Each document can contain several of those numbers and I’d like to be able to query the set of XMLs based on attributes of the smart number. For example how many hours were billed for a given department or get a break down of how many hours went to a given project (this data can be spread across many documents). This makes me think that I need to somehow attache the parsed data to the XML document.

I’m new to Marklogic and I’m wondering what would be considered best practice for this kind of situation. One thing I can think of is to edit each XML file and add the parsed data into the XML:

So this:

<ELEMENT>
    <SMART_NUMBER>Blah, Blah, Blah</SMART_NUMBER>
</ELEMENT>
<ELEMENT>
    <SMART_NUMBER>Blah2, Blah2, Blah2</SMART_NUMBER>
</ELEMENT>

Becomes this:

<ELEMENT>
    <SMART_NUMBER>Blah, Blah, Blah</SMART_NUMBER>
    <PARSED_DATA>
        <DEPARTMENT>BLAH BLAH</DEPARTMENT>
        <PROJECT>BLAH BLAH</PROJECT>
        …
    </ PARSED_DATA>
</ELEMENT>
<ELEMENT>
    <SMART_NUMBER>Blah2, Blah2, Blah2</SMART_NUMBER>
    <PARSED_DATA>
        <DEPARTMENT>BLAH2 BLAH2</DEPARTMENT>
        <PROJECT>BLAH2 BLAH2</PROJECT>
        …
    </ PARSED_DATA>
</ELEMENT>

I’m not sure if there is a ‘better’ way, using Semantics seems possible: for each smart number in a document create a triplet that links the document to the smart number. Then for each smart number create a set of triplets that that define the various parts of the smart number. But I’m very unfamiliar with using semantics so I don’t know if this approach would even be worth pursuing. Any ideas/suggestions would be welcome.

Thanks for the replies, I went ahead and used the approach where I update the XML and it seems to be working fine. — David Harris, Jul 05 '16 at 19:38

score 0 · Answer 1 · answered Jun 29 '16 at 19:01

I think you are on the right track. If you want fast faceted search, then denormalizing the data is by far the simplest approach. But instead of translating the codes to names (which requires complex lookups if I understood correctly), you could also consider just splitting the smart number into separate identifiers, like department-id, project-id. You can always translate the id to a name later on on the fly.

Using semantics could be fun, but is mostly interesting if you want to link to other linked data sources, want to use SPARQL, or would like to infer relations..

HTH!

score 0 · Accepted Answer · edited Jun 20 '20 at 09:12

I also think you are on right track.

If you want to use triples to store the data, it will be an interesting idea. For triples, as you rightly pointed, you can save various parts of smart number against smart number, triple may look like this -

<smart-number-1> <predicate\department> <department-1>

<smart-number-1> <predicate\project> <project-1>

You can also use graph-name while inserting triples in case you want to partition data by any parameter. If you are using graphs, you might need to set graph permissions.

PS: Graphs are the XML collections equivalent for Triples

Hope this helps!

Adding extra data to XML documents in Marklogic

2 Answers2