0

I'm aiming to store and index JSON key pair values. Ideally I would store them in a constant fieldname. (For simplicity sake, "GRADES")

An example of the incoming JSON object:

    "Data": [{
        "Key": "DP01",
        "Value": "Excellent"
    }, {
        "Key": "DP02",
        "Value": "Average"
    }, {
        "Key": "DP03",
        "Value": "Negative"
    }]

The JSON object would be serialized and stored as it is, but I would like to index it in a way to enable me to search within that same field by key and value. The main idea is to search multiple values within the same Lucene Field.

Any suggestions on how to structure the indexing? Lets imagine for example that I would like to search using following query:

[GRADES: "key:DP01 UNIQUEIDasDELIMITER value:Excellent"]

How would a customer analyzer/tokenizer achieve this ?

EDIT: An attempt to depict my goal more accurately.

Think of this typical relational type of structure (for simplicity sake).

  • Each document is a website.

  • A website can have multiple images (and other important metadata).

  • Each image has multiple sets of free keyvaluepair properties:

    {
        "Key": "Scenery",
        "Value": "Nature"
    }, {
        "Key": "Style",
        "Value": "Vintage"
    }
    
  • Another set:

    {
        "Key": "Scenery",
        "Value": "Industrial"
    }, {
        "Key": "Style",
        "Value": "Vintage"
    }
    

My challenge is come from a similar type of structure and index it in a way which enables me to build queries such as:

A website with an image of scenery:industrial and style:vintage.

I'm probably taking the wrong approach as indicated by Andy Pook. Any ideas how to efficiently flatten out these properties?

pelican_george
  • 961
  • 2
  • 13
  • 33
  • per request in http://stackoverflow.com/questions/22465256/indexing-json-object-arrays-in-lucene-net/23513353?noredirect=1#comment58247952_23513353 : You seem to aim at a different way of indexing these data, so it is not quite the same. As pr. that question I index key and value in their own fields: "Data.Key", "Data.Value" which allows searches for "Data.Key: DP01 AND Data.Value: Average" or just one of them, Problem is that this would yield the document as a result in this case which I assume you don't want, this was a limitation I accepted in my case. – Jens Feb 08 '16 at 13:47

2 Answers2

1

How about storing the JSON Data in a multi-valued field, e.g. GRADES, like this:

GRADES: "Key DP01 Value Excellent"
GRADES: "Key DP02 Value Average"
GRADES: "Key DP03 Value Negative"

You could then run a query like this:

GRADES: ("Key DP01" AND "Value Excellent")

cris almodovar
  • 151
  • 1
  • 5
  • My goal is to index multiple "Data" json objects and be able to search within, for specific keyvalue pairs. If I flatten out all data as you suggest I won't be able to distinguish them apart from each "List". – pelican_george Mar 21 '16 at 13:23
1

A common "problem" is to think about indexes and documents as having a consistent set of fields. It is not the same as a relational database with tables of a fixed set of columns.

in a previous life I had an entity with a set of "attributes". A key/value collection (much like your grades).

Each document was created with fields named for each attribute ie "attr-thing" with the value added "NOT_ANALYZED".

So, in your example I'd create fields like

new Field("grade-"+gradeID, grade, Field.Store.NO, Field.Index.NOT_ANALYZED)

Then you can search with a query like "grade-DP01:excellent".

Alternatively you can just have a fixed field name (similar to @cris-almodovar) and set the value to something like "id=grade". Again NOT_ANALYZED. The search for "grade:DP01=excellent".

Either will work. I've used both approaches with success but typically prefer the first.

Additional in response to edit...

I think I understand the problem... If you had "scenery=industrial style=vintage" and "scenery=nature style=modern" you wouldn't want it to match if you searched "nature AND vintage", right?

You could add an "imageType" field for each set with a value like "scenery=industrial style=vintage abc=xyz" with the KeywordAnalyzer (just splits by space).

Then search with imageType:"scenery=industrial style=vintage"~2. Using a slop phrase guarantees that the values are in the same field and the slop allows for the order to be different or for there to be extra values. The number you'd have to figure out based on the number of properties you expect in each field. Simplistically, if you expect for there to be a max of N values then the slop should be N too.

AndyPook
  • 2,762
  • 20
  • 23
  • thanks for the feedback. Let me use a better example to depict what I'm trying to achieve. Please see the updated example. – pelican_george Mar 21 '16 at 14:42
  • also, I should mention that these documents are often updated/reused whereas the new incoming "entity" which will append data to this document has no knowledge of previously indexed values. Consequently I need to store these "old" fields. – pelican_george Mar 21 '16 at 14:44
  • Lets say that you add "scenery=Nature style=Wild" and "scenery=Industrial style=Grunge". This will create 4 terms (scenery=Nature, style=Wild, scenery=Industrial and style=Grunge. If I search for imageType:"style=Wild scenery=Industrial" won't I get a hit back? – pelican_george Mar 22 '16 at 13:38
  • no, not with the slop phrase type of query. The terms specified must be proximate to each other (within the slop). A simple phrase query implies slop 0. They _must_ be in the same field – AndyPook Mar 22 '16 at 20:34
  • I'm not sure if I mentioned, but the imageType is a multi valued field. So I have multiple "sets" of key value pairs within the same document. – pelican_george Mar 23 '16 at 07:31
  • yup. Lucene allows for a doc for have many fields of the same name. So you add multiple fields, one for each set – AndyPook Mar 23 '16 at 10:45
  • so those 4 terms will have a direct "relationship" to the very same field "instance"? imagine I have two imageType fields, the 1st with terms "scenery=A" and "style=Wild", the 2nd with terms "scenery=B" and "style=Urban". If I search for imageType:"scenery=B style="Wild"~2 I should get 0 hits back right ? – pelican_george Mar 23 '16 at 12:46
  • 1
    Correct, zero hits. A lucene Document is just a collection of fields. It doesn't matter if more than one field has the same name. A search just looks for a match in a named field. So you would have many "instances" of fields (with the same name) added to the Document – AndyPook Mar 23 '16 at 12:56