Solr indexing of MongoDB collection

Question

Suppose I have a test application representing some friends list. The application uses a collection where all documents are in the following format:

_id : ObjectId("someString"),
name : "George",
description : "some text",
age : 35,
friends : {
    [
        {
         name: "Peter",
         age: 30
         town: {
                  name_town: "Paris",
                  country: "France"
               }
        },
        {
         name: "Thomas",
         age: 25
         town: {
                  name_town: "Berlin",
                  country: "Germany"
               }
        }, ...                // more friends
    ]
}
...                          // more documents

How can I describe such collection in the schema.xml ? I need to produce facet queries like: "Give me countries, where George's friends live". Another use case may be - "Return all documents(persons), whose friend is 30 years old." etc.

My initial idea is to mark "friends" attribute as text field by this schema.xml definition:

<fieldType name="text_wslc" class="solr.TextField" positionIncrementGap="100">
....
<field name="friends" type="text_wslc" indexed="true" stored="true" />

and try to search for eg. "age" and "30" words in the text, but it is not a very reliable solution.

Please, leave aside not logically well-formed architecture of the collection. It is only an example of similar problem I am just facing.

Any help or idea will be highly appreciated.

EDIT: Sample 'schema.xml'

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="text-schema" version="1.5">
    <types>
        <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
        <fieldType name="long" class="solr.TrieLongField" precisionStep="0" omitNorms="true" positionIncrementGap="0" />
        <fieldType name="trInt" class="solr.TrieIntField" precisionStep="0" omitNorms="true" />
        <fieldType name="text_p" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.TrimFilterFactory"/>
                <filter class="solr.WordDelimiterFilterFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.TrimFilterFactory"/>
                <filter class="solr.WordDelimiterFilterFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
            </analyzer>
        </fieldType>
    </types>

    <fields>
            <field name="_id" type="string" indexed="true" stored="true" required="true" />
            <field name="_version_" type="long" indexed="true" stored="true"/>
            <field name="_ts" type="long" indexed="true" stored="true"/>
            <field name="ns" type="string" indexed="true" stored="true"/>               
            <field name="description" type="text_p" indexed="true" stored="true" />
            <field name="name" type="text_p" indexed="true" stored="true" />
            <field name="age" type="trInt" indexed="true" stored="true" />  
            <field name="friends" type="text_p" indexed="true" stored="true" />         <!-- Here is the problem - when the type is text_p, all fields are considered as a text; optimal solution would be something like "collection" tag to mark name_town and town as descendant of the field 'friends' but unfortunately, this is not how the solr works-->

            <field name="town" type="text_p" indexed="true" stored="true"/> 
            <field name="name_town" type="string" indexed="true" stored="true"/>    
            <field name="town" type="string" indexed="true" stored="true"/> 
    </fields>

    <uniqueKey>_id</uniqueKey>

Well, if you want to stick to your schema idea, I do not see a solution for your requirement. You will need the join feature, as you want to do something like nested entities. There is no other reliable way to query for something like this without running into an update-hell. — cheffe, Oct 04 '13 at 14:09

cheffe · Answer 1 · 2013-10-04T12:19:54.060

As Solr is document-centric you will need to flatten as much as you can down. According to the sample you have given, I would create a schema.xml like the one below.

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="friends" version="1.0">

    <fields>
        <field name="id" 
            type="int" indexed="true" stored="true" multiValued="false" />
        <field name="name" 
            type="text" indexed="true" stored="true" multiValued="false" />
        <field name="description" 
            type="text" indexed="true" stored="true" multiValued="false" />
        <field name="age" 
            type="int" indexed="true" stored="true" multiValued="false" />
        <field name="town" 
            type="text" indexed="true" stored="true" multiValued="false" />
        <field name="townRaw" 
            type="string" indexed="true" stored="true" multiValued="false" />
        <field name="country" 
            type="text" indexed="true" stored="true" multiValued="false" />
        <field name="countryRaw" 
            type="string" indexed="true" stored="true" multiValued="false" />
        <field name="friends" 
            type="int" indexed="true" stored="true" multiValued="true" />
    </fields>
    <copyField source="country" dest="countryRaw" />
    <copyField source="town" dest="townRaw" />

    <types>
        <fieldType name="string" class="solr.StrField" sortMissingLast="true"/>
        <fieldType name="int" class="solr.TrieIntField" 
            precisionStep="0" positionIncrementGap="0" />
        <fieldType name="text" class="solr.TextField" 
            positionIncrementGap="100">
            <analyzer>
                <tokenizer class="solr.StandardTokenizerFactory" />
                <filter class="solr.LowerCaseFilterFactory" />
            </analyzer>
        </fieldType>
    </types>
</schema>

I would go with the approach to model each person for itself. The relationship between two persons is modelled via the attribute friends, which translates into an array of IDs. So at index time you would need to fetch the IDs of all friends for a person and put them into that field.

Most of the other fields are straight forward. Interesting are the two Raw fields. Since you said that you want to facet on the country you will need the country unchanged or optimized for faceting. Usually the types of fields differ depending on their purpose (searching for them, faceting by them, autosuggesting them, etc.). In this case country and town are indexed just as they are given in.

Now to your use cases,

Give me countries, where George's friends live

This can then be done by faceting. You could query

for the ID of George
facet on countryRaw

Such a query would look like q=friends:1&rows=0&facet=true&facet.field=countryRaw&facet.mincount=1

Return all documents(persons), whose friend is 30 years old.

This one is harder. First off you will need Solr's join feature. You need to configure this in your solrconfig.xml.

<config>
    <!-- loads of other stuff -->
    <queryParser name="join" class="org.apache.solr.search.JoinQParserPlugin" />
    <!-- loads of other stuff -->
</config>

The according join query would look like this q={!join from=id to=friends}age:[30 TO *]

This explains as follows

with age:[30 TO *] you search for all persons that are of age 30 or older
then you take their id and join it on the friends attibute of all others
this will return you all persons that have the ids matched by the initial query within their friends attribute

As I have not written this off of my mind, you may have a look on my solrsample project on github. I have added a test case there that deals about the question:

https://github.com/chriseverty/solrsample/blob/master/src/main/java/de/cheffe/solrsample/FriendJoinTest.java

Cheffe, Thank you for your precisely answered question. But maybe I didn't really emphasise that the schema shouldn't be altered. Let's say that the schema is stated. Can you find any possible solution how it could be possible to access the specified data? — user1949763, Oct 04 '13 at 12:29
user1949763, In that case I need more of your schema.xml. At best the whole `` element including your `types`. — cheffe, Oct 04 '13 at 12:34
I added the 'schema.xml'definition to the original post. But the definitions are rather vague because I was not able to overcome the limitation... — user1949763, Oct 04 '13 at 12:50

Solr indexing of MongoDB collection

1 Answers1