MarkLogic cts query to return distinct values in Data Hub

Question

I'm looking for a MarkLogic cts query that I can use as a source query in Data Hub to return the distinct values on a combination of json property paths.

For example, I 200K+ docs with this structure:

{
    "name": "Bentley University",
    "unit": null,
    "type": "University or College",
    "location": {
        "state": {
            "code": "MA",
            "name": "Massachusetts"
        },
        "division": "New England",
        "region": "Northeast",
        "types": [
            "School-College",
            "University"
        ]
    }
}

I would like to have cts query that returns the distinct name + location/state/code values.

I've tried using cts.jsonPropertyScope() but that return entire docs. I just want the distinct values returned.

This is the perfect use case for indexes. Are you looking for a solution that would leverage an index, or is that not an easy option? — Mads Hansen, Jan 10 '23 at 15:45
Thx @MadsHansen. I can definitely index the values and run this query in QC: `cts.valueCoOccurrences(cts.pathReference("/envelope/instance/state_abbr"), cts.pathReference("/envelope/instance/pug_agency_name"))` When I try to use the same query in a Data Hub flow, I get this error: `Extension Error: code: 400 message: Unable to collect items to process; sourceQuery script: cts.uris(null, null, cts.valueCoOccurrences(cts.pathReference('/envelope/instance/state_abbr'), cts.pathReference('/envelope/instance/pug_agency_name'))); error: arg3,cts.query? document: %3` — Abhai Raj Llewellyn, Jan 10 '23 at 19:22
cts.valueCoOcurrences() will return a Sequence of Arrays with the two values. Are you looking to also report the URIs of the docs those values are in? For that you can use `cts.valueTuples([cts.uriReference(), cts.pathReference("/name"), cts.pathReference("/location/state/code")])` — Mads Hansen, Jan 10 '23 at 19:41
@MadsHansen thx again for the help. I want to normalize data ingested into Data Hub. The data is hate incidents (csv). Each row represents a single hate incident reported by a publishing agency (usually a city - identified by state code and city name). Any city can report multiple incidents. There are 250K+ incidents reported by 9K+ publishing agencies. When I harmonize, I'd like to extract the publishing agency data into docs and keep only elements needed to ref them in the incident docs. Maybe I can use a `before` hook to create a single doc (using the query you gave) and then process that? — Abhai Raj Llewellyn, Jan 10 '23 at 21:52

score 1 · Answer 1 · answered Jan 10 '23 at 16:50

If you had a range index on those two fields, then you could do this very easily with cts.valueCoOcurrences().

Returns value co-occurrences (that is, pairs of values, both of which appear in the same fragment) from the specified value lexicon(s). The values are returned as an ArrayNode with two children, each child containing one of the co-occurring values. You can use cts.frequency on each item returned to find how many times the pair occurs. Value lexicons are implemented using range indexes; consequently this function requires a range index for each input index reference. If an index or lexicon is not configured for any of the input references, an exception is thrown.

For example, with path-range-indexes on the /name and /location/state/code then the query would look like this:

cts.valueCoOccurrences(cts.pathReference("/name"), cts.pathReference("/location/state/code"))

Without indexes, then the brute force method that reads all of the documents would look something like this:

const nameAndCode = new Set();
for (const doc of cts.search(cts.andQuery([cts.jsonPropertyScopeQuery("name", cts.trueQuery()), 
                                           cts.jsonPropertyScopeQuery("code", cts.trueQuery())])) ) {
  const obj = doc.toObject()
  nameAndCode.add(obj.name+","+obj.location.state.code);
}
Array.from(nameAndCode)

But it might be slow, and you run the risk of blowing an Expanded Tree Cache error if all of the docs can't be read at once.

You could also do some sort of iterative search to sample some documents, add their values to the set, and then use the accumulated values to exclude docs in the next search until it no longer returns any documents or hits some limit of number of searches.

MarkLogic cts query to return distinct values in Data Hub

1 Answers1