Solr 5.1.0 and 5.2.1: Creating parent-child docs using DIH

Question

I am trying to import into Solr 5.1.0 and 5.2.1 with a data-config that should produce documents with the following structure:

<parentDoc>
    <someParentStuff/>
    <childDoc>
        <someChildStuff/>
    </childDoc>
</parentDoc>

From what I understand from one of the answers on this question about nested entities in DIH, my versions of Solr should be able to create the above structure with the following data-config.xml:

<dataConfig>
    <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" 
            url=""
            user=""
            password=""
            batchSize="-1"
    />
    <document name="">
        <entity rootEntity="true" name="parent" pk="parent_id" query="select * from parent">
            <field column="parent_id" name="parent_id" />

            <entity child="true" name="child" query="select * from child where parent_id='${parent.parent_id}'">
                <field column="parent_id" name="parent_id" />
                <field column="item_status" name="item_status" />
            </entity>
       </entity>
    </document>
</dataConfig>

However, when I perform a full-import, I get:

<result name="response" numFound="2" start="0">
  <doc>
    <long name="parent_id">477</long> <!-- This is from the child -->
    <str name="item_status">WS</str>
  </doc>
  <doc>
    <long name="parent_id">477</long> <!-- This is from the parent -->
  </doc>
</result>

which I understand is the denormalized layout you're supposed to get pre-5.1.0; however, I expected:

<result name="response" numFound="1" start="0">
    <doc>
        <long name="parent_id">477</long>
        <doc>
            <long name="parent_id">477</long>
            <str name="item_status">WS</str>
        </doc>
    </doc>
</result>

What do I need to do to get my desired document structure? Am I misunderstanding what nested entities in the DIH are supposed to do?

score 3 · Accepted Answer · answered Apr 06 '16 at 22:28

Unless someone swings by to tell me otherwise, it seems I have really misunderstood the creation of parent-child docs in Solr 5.1.0+. I was expecting to be able to nest documents and have them returned, but that's not possible with Solr (at least at this point. The future is a mystery.)

Solr is a flat document model. What this means is it doesn't model parent-child relationships in the way I wanted in my original question. There is no nesting. Everything is flat and denormalized.

What Solr does is it adds n-number of child documents next to their parent in a contiguous block. For example:

childDoc1 childDoc2 childDoc3 parent

This structure is actually reflected in the documents I was "mistakenly" getting returned from Solr:

<result name="response" numFound="2" start="0">
  <doc>
    <long name="parent_id">477</long> <!-- This is from the child -->
    <str name="item_status">WS</str>
  </doc>
  <doc>
    <long name="parent_id">477</long> <!-- This is from the parent -->
  </doc>
</result>

The nested document support available in the dih after Solr 5.0 is actually an add-on or outright replacement for the old way people used to have to index nested documents, and also seems to take care of updating child + parent docs at the same time for you. Very convenient!

So, then, how do you express a parent-child relationship when Solr destroys that nice, nested document model you had planned? You have to get the parent docs and the child docs and manage the relationship in your application. How do you get the parents and children?

The answer is block joins.

Use block joins during query time, and then in your application, process those documents into your desired structure. Let's look at two block join queries because they can look a bit weird at first.

{!parent which='type:parent'}item_id:5918307

This block join query says, "Get me the parent document that has one or more child documents with the item_id of 5918307."

{!child of='type:parent'} (fieldA:TERM^100.0 OR fieldB:term^100.0 OR fieldC:term OR (fieldD:term^20.0)) AND (instock:true^9999.0)

This block join query says, "Get me one or more child documents whose parent documents contain the word 'term' and are in stock."

Do NOT search on child fields when doing !child queries. So, referencing the first example, you would not search on item_id, because that would give you a 500 error.

You'll notice the type field in these queries. You do have to add this to your schema and data-config yourself. In the schema, it looks like this:

<!-- use this field to differentiate between parent and child docs -->
<field name="type" type="string" indexed="true" stored="false" />

Then in data-config.xml, just do something like the following for the parent:

if ('true' = 'true', 'parent', 'parent') as type

And the do the same for the child, except substitute "child" where you put "parent" before.

So in the end you might end up making two queries, but it doesn't seem like adding the block join parser adds too much to query time. I'm seeing maybe an extra 50 or 100ms per query.

You can also usually bypass needing nested documents by being smart with your joins. What I've discovered, however, is that because the child documents now mingle with the parent documents, you have one "copy" of each parent with specific child information in your index. In this situation, you would grab the known parent fields from the first document, along with its child fields. For the rest of the documents, you would just grab the child fields.

Another option, when you just want the parent doc and don't want a whole bunch of other docs being returned, is to use grouping queries; however, I wouldn't recommend it. When I tried it on a query that returned a large number of results, I saw query times go from a 10ms - 250ms range all the way up to the 500ms - 1s range.

Solr 5.1.0 and 5.2.1: Creating parent-child docs using DIH

1 Answers1