2

I'm implementing a solr more like this handler to find similar customers.

I have 2 customers, with different names that live on the same address. I want to give an entity_id to solr and get all clients with similar names / addresses back. The client will be able to link both customers together with the click of a button.

I'm using the SolariumBundle to do this in code, but it should be enough to get it to work with the raw query first, if that works I can adapt it to solarium myself.

This is my solrconfig.xml

<?xml version="1.0" encoding="UTF-8" ?>
<config>
  <luceneMatchVersion>LUCENE_36</luceneMatchVersion>
  <directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:solr.StandardDirectoryFactory}"/>

<updateHandler class="solr.DirectUpdateHandler2" />

<requestDispatcher handleSelect="true" >
    <requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" />
</requestDispatcher>

<!-- request handlers -->
<requestHandler name="standard" class="solr.StandardRequestHandler" default="true" />
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />
<requestHandler name="/mlt" class="solr.MoreLikeThisHandler">
<lst name="defaults">
  <int name="mlt.mintf">2</int>
  <int name="mlt.mindf">1</int>
  <int name="mlt.minwl">5</int>
  <int name="mlt.maxwl">1000</int>
  <int name="mlt.maxqt">50</int>
  <int name="mlt.maxntp">50000</int>
  <bool name="mlt.boost">true</bool>
  <str name="mlt.fl">customer_data,entity_data,street</str>
  <bool name="mlt.match.include">false</bool>
</lst>
</requestHandler>

<requestHandler name="/admin/" class="org.apache.solr.handler.admin.AdminHandlers" />

<!-- config for the admin interface --> 
<admin>
    <defaultQuery>solr</defaultQuery>
</admin>
</config>

The relevant part of my schema.xml is:

<fields>
    <!-- general -->
    <field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true" />
    <field name="type" type="string" indexed="true" stored="true" multiValued="false" required="true"/>
    <field name="entity_id" type="string" indexed="true" stored="true" multiValued="false" required="true"/>
    <field name="sort_id" type="int" indexed="true" stored="true" multiValued="false"/>

    <field name="external_id" type="string" indexed="true" stored="true" multiValued="false"/>
    <field name="status" type="text" indexed="true" stored="true" multiValued="false"/>
    <field name="language" type="string" indexed="true" stored="true" multiValued="false"/>
    <field name="created" type="int" indexed="true" stored="true" multiValued="false"/>

    <field name="name" type="text" indexed="true" stored="true" multiValued="false"/>
    <field name="email" type="string" indexed="true" stored="true" multiValued="false"/>
    <field name="city" type="string" indexed="true" stored="false" multiValued="false"/>
    <field name="country" type="string" indexed="true" stored="false" multiValued="false"/>
    <field name="street" type="string" indexed="true" stored="false" multiValued="false"/>
    <field name="zipcode" type="string" indexed="true" stored="false" multiValued="false"/>

    <field name="entity_data" type="text_ngrm" indexed="true" stored="true" multiValued="true"/>
    <field name="customer_data" type="text_ngrm" indexed="true" stored="true" multiValued="true" termVectors="true" />

    <!-- Entity data filling -->
    <copyField source="entity_id" dest="entity_data"/>
    <copyField source="briljant_id" dest="entity_data"/>
    <copyField source="name" dest="entity_data"/>
    <copyField source="email" dest="entity_data"/>
    <!-- End entity data -->

    <!-- Customer data -->
    <copyField source="name" dest="customer_data"/>
    <copyField source="email" dest="customer_data"/>
    <copyField source="city" dest="customer_data"/>
    <copyField source="country" dest="customer_data"/>
    <copyField source="street" dest="customer_data"/>
    <copyField source="zipcode" dest="customer_data"/>
    <!-- End customer data -->
</fields>

I currently execute this query: http://localhost:8983/solr/core0/mlt?q=entity_id%3A50&wt=json&indent=true&mlt.fl:customer_data and that does return results for customers that have a similiar name. For example if customer_id:50 (the one I'm querying for) has the name "Foo Bar", it does return customers with the names "Foo Bar", "Bar Foo", "John Foo". The similiarity on street / country / zipcode doesn't work.

In the debug:parsedquery I can see different mutations of customer_data:Foo customer_data:Bar customer_data oo Bar, ... but nothing on the address part.

How can I make sure that the query is for: customer_data:Foo customer_data:Bar customer_data:teststreet customer_data:Antwerp?

Botchcake
  • 21
  • 1
  • I'm guessing you meant to call it with `&mlt.flt=customer_data` and not `:`? – MatsLindh Aug 22 '14 at 12:56
  • @MatsLindh, yes the actual query is: ``/mlt?q=entity_id%3A50&wt=json&indent=true&mlt.fl=customer_data``, this doesn't give a different result though. – Botchcake Aug 22 '14 at 13:09
  • I'm guessing it's because the name entries are already giving you enough matches. Use just the street field instead of customer_data, and you should see matches on street. – MatsLindh Aug 22 '14 at 13:12
  • @MatsLindh, if i use the querystring: ``mlt?q=entity_id%3A50&wt=json&indent=true&mlt.fl=street`` I get the 0 results. ``"response":{"numFound":0,"start":0,"docs":[]`` – Botchcake Aug 22 '14 at 13:42
  • Well, the field isn't stored. That might affect it. – MatsLindh Aug 22 '14 at 13:48
  • I've stored the field and cleared and reindexed all customers before testing this again. – Botchcake Aug 22 '14 at 13:56

1 Answers1

1

Fields that are defined as type string won't get tokenized as much, therefore MLT will find less similar documents.

Change the affected fields to a type that is of class solr.TextField and it should work.

E.g.:

<!-- type definition -->
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer> 
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>
Howie
  • 2,760
  • 6
  • 32
  • 60