0

So in Solr we have the following use case:

a collection of keywords with keyword (the actual phrase that we query by), a match_type(which can be negative, exact or broad). Currently we have 2 field Types, with different set of query and index filters, one for exact and negative(they are the same) and one for broad. And documents would look like this:

{
kw_exact: "pink dress",
match_type: "exact",
adset_id: 1
},
{
kw_broad: "pink dress",
match_type: "broad",
adset_id: 1,
},
{
kw_negative: "red dress",
match_type: "negative",
adset_id: 1
}

What we want is to get the keyword with the highest score per adset and if a negative keyword wins, exclude it from the results.

`/select
?group.field=adset_id
&group.limit=1
&sort=score desc
&group=true
&defType=edismax
&qf=kw_exact_test_bool2^6 kw_broad_test_bool2 kw_negative_test_bool2^7
&rows=200
&fl=adset_id,kw_broad,kw_exact,kw_negative,match_type
&q=dress
&fq=NOT match_type:2`

this strategy does not work as fq is applied before the grouping and if a negative keyword has the highest score inside an adset we would not know. Taking the above example: if the user searched for red dress the negative and broad would match, with negative having the higher score, the query above would return the following in the results:

{
                    "groupValue": null,
                    "doclist": {
                        "numFound": 2,
                        "start": 0,
                        "maxScore": 18.0,
                        "numFoundExact": true,
                        "docs": [
                            {
                                "kw_broad": "pink dress",
                                "adset_id": 1,
                                "match_type": "broad"
                            }
                        ]
                    }
               }

whilst we want no values for the adset_id: 1 in this case.

We also played around with nested documents but the block and join query parsers seem to be quite slow at times and we read that solr does not actually support nested docs, they are still stored as separate. We also couldn't come up with a query that would render the results we desire.

the nested docs schema would look like this:

{
adset_id: 1,
keywords: [
{
kw_exact: "pink dress",
match_type: "exact",
adset_id: 1
},
{
kw_broad: "pink dress",
match_type: "broad",
adset_id: 1,
},
{
kw_negative: "red dress",
match_type: "negative",
adset_id: 1
}]
}

We would be open to a solution for both schemas, any ideas?

Andreea
  • 11
  • 1

1 Answers1

0

We got around it by using various query parser, the end query looks like this:

select?fl=kw_exact,kw_broad,kw_negative&cache=false&fq={!collapse field=adset_id max=cscore()}&q=-{!join from=adset_id to=adset_id}{!df=kw_negative v=$qq} AND {!dismax qf=kw_broad qf=kw_exact^6 v=$qq}&qq=mySearchPhrase

Although there is a limitation, since we have shingles on kw_exact and kw_negative we get into the maxBooleanClauses error pretty quickly.

Andreea
  • 11
  • 1