7

I need to index a patent catalog that has the following data structure:

  "cpc": [
    {
      "class": "61",
      "section": "A",
      "sequence": "1",
      "subclass": "K",
      "subgroup": "06",
      "main-group": "45",
      "classification-value": "I"
    },
    {
      "class": "61",
      "section": "A",
      "sequence": "2",
      "subclass": "K",
      "subgroup": "506",
      "main-group": "31",
      "classification-value": "I"
    }
]

I was wondering what is the right approach here. I might be able to use cpc.class and combine it with multiValued="true".

I would like to find documents that match a certain CPC code. The CPC code can be partial. Right now my solution is simply use a nested reference with multiValued=true. Is there a better way of doing this?

<field name="cpc.class"                 type="int"    indexed="true" stored="true" multiValued="true" />
<field name="cpc.section"               type="string" indexed="true" stored="true" multiValued="true" />
<field name="cpc.sequence"              type="int"    indexed="true" stored="true" multiValued="true" />
<field name="cpc.subclass"              type="string" indexed="true" stored="true" multiValued="true" />
<field name="cpc.subgroup"              type="int"    indexed="true" stored="true" multiValued="true" />
<field name="cpc.main-group"            type="int"    indexed="true" stored="true" multiValued="true" />
<field name="cpc.classification-value"  type="string" indexed="true" stored="true" multiValued="true" />

The problem with this implementation is that it returns documents not actually matching the search criteria. Example:

"cpc.section:A",
"cpc.class:61",
"cpc.subclass:Q",
"cpc.main-group:8"

I get documents not having this combination. I think the current way implements the search so that every field is a list and matching values in any combination are returned. I need to narrow it down so only the right combinations are returned.

Istvan
  • 7,500
  • 9
  • 59
  • 109
  • What do you want to FIND? Structure your Solr index around finding, not around original data structure. – Alexandre Rafalovitch Nov 16 '15 at 02:25
  • 1
    The CPC is a hierarchical code, isn't it? Should you model a patent-index, consider [hierarchical facets](https://www.google.de/search?q=solr+hierarchical+facets). – cheffe Nov 16 '15 at 06:18
  • Possible duplicate of [Solr documents with child elements?](http://stackoverflow.com/questions/5584857/solr-documents-with-child-elements) – Alexander Kuznetsov Nov 16 '15 at 12:57
  • I would like to search the documents so that the results are returned only if inside one Hash the search terms are present. The multiValued way yields results that are not actually matching the search criteria. See more update in the question. – Istvan Nov 19 '15 at 15:50
  • Are you using Solr or Lucene? Which version? If Solr, how are you accessing it: SolrJ, Solr.NET, ... ? – cheffe Nov 24 '15 at 11:02
  • Solr 4.7, SolrJ I believe, that is shipped with Riak – Istvan Nov 24 '15 at 17:50

1 Answers1

0

The best way to index this with Solr is to split out the nested data structures (cpcs) to flat documents and have the patent_id included there. That way an arbitrary combination of partial cpcs can be searched.

Istvan
  • 7,500
  • 9
  • 59
  • 109