0

When creating the mappings for an index that can search through multiple books, is it preferable to use nested mappings like below, or using documents with a parent-child relationship

book: {
  properties: {
    isbn:     {       //- ISBN of the book
      type: 'string'  //- 9783791535661
    },
    title:    {       //- Title of the book
      type: 'string'  //- Alice in Wonderland
    },
    author:   {       //- Author of the book(maybe should be array)
      type: 'string'  //- Lewis Carroll
    },
    category: {       //- Category of the book(maybe should be array)
      type: 'string'  //- Fantasy
    },
    toc: {            //- Array of the chapters in the book
      type: 'nested',
      properties: {
        html: {           //- HTML Content of a chapter
          type: 'string'  //- <!DOCTYPE html><html>...</html>
        },
        title: {          //- Title of the chapter
          type: 'string'  //- Down the Rabbit Hole 
        },
        fileName: {       //- File name of this chapter
          type: 'string'  //- chapter_1.html
        }, 
        firstPage: {      //- The first page of this chapter
          type: 'integer' //- 3
        }, 
        numberOfPages: {  //- How many pages are in this chapter
          type: 'integer' //- 27
        },
        sections: {       //- An array of all of the sections within a chapter
          type: 'nested',
          properties: {
            html: {           //- The html content of a section
              type: 'string'  //- <section>...</section>
            },
            title: {          //- The title of a section
              type: 'string'  //- section number 2 or something
            },
            figures: {        //- Array of the figures within a section
              type: 'nested',
              properties: {
                html: {           //- HTML content of a figure
                  type: 'string'  //- <figure>...</figure>
                },
                caption: {        //- The name of a figure
                  type: 'string'  //- Figure 1
                },
                id: {             //- Id of a figure
                  type: 'string', // figure4
                }
              }
            },
            paragraphs: {     //- Array of the paragraphs within a section
              type: 'nested',
              properties: {   
                html: {           //- HTML content of a paragraph
                  type: 'string', //- <p>...</p>
                }
                id: {             //- Id of a paragraph
                  type: 'string', // paragraph3
                }
              }
            }
          }
        }
      }
    }
  }
}

The size of an entire books html is approximately 250kB. I would want to query things such as

- the best matching paragraph including it's nearest paragraphs on either side
- the best matching section from a single book including any child sections
- the best figure given it is inside a section with a matching title
- etc

I don't really know the specifics of the queries I would want to perform, but it is important to have a lot of flexibility to be able to try out very weird ones without having to change all of my mappings too much.

Ryan White
  • 2,366
  • 1
  • 22
  • 34

2 Answers2

3

If you use the nested type, everything will be contained in the same _source document, which for big books can be quite a mouthful.

Whereas if you use parent/child docs for each chapters and/or sections, you might end up with smaller chunks which are more chewable...

As always, it heavily depends on the queries you will want to make, so you should first think about all the use cases you will want to support and then you'll be better armed to figure out which approach is best.

There's another approach which uses neither nested nor parent/child, and which simply involves denormalization. Concretely, you pick the smallest "entity" you want to consider, e.g. a section, and then simply create standalone documents for each section. In those section documents, you'd have fields for the book title, author, chapter title, section title, etc.

You can try each approach in their own index and see how it goes for your use cases.

Val
  • 207,596
  • 13
  • 358
  • 360
  • In terms of it being a mouthful for big books, do you mean that you have to return the entire book when matching a query to find a good paragraph, or can you select only some information to return when querying? – Ryan White Feb 02 '16 at 14:30
  • You can always select only a single nested object using [nested `inner_hits`](https://www.elastic.co/guide/en/elasticsearch/reference/2.1/search-request-inner-hits.html#nested-inner-hits), however since you're storing the full HTML for all sections, it means your source document can grow quite big since it will contain the whole book. That might not be advisable depending on how big your books are. – Val Feb 02 '16 at 14:34
  • The approximate book size prior to indexing is 250kB. This seems very large for a document, but I might want to try and query things such as find the best matching paragraph, which has nearby paragraphs or sections which are also a decent match, and many other weird queries that I might try. I'll update these details in the question too to make it less vague. Thanks for your help – Ryan White Feb 02 '16 at 14:46
  • Sure thing. You'll probably need to create another more detailed question, however, so as to not cram to much into this one and keep the topic lean. – Val Feb 02 '16 at 14:50
0

nested is basically a way of stuffing everything into the same document. That can be useful for searching, but it makes certain things considerably harder.

Like - for example - if you're trying to find a particular chapter section - your query will return the correct document - the whole book. I would imagine that's probably not what you're looking for, and thus a parent/child relationship would be the appropriate way to go.

Or just don't bother, and treat book/chapter/section as separate types within an index which query and 'join' on demand.

Sobrique
  • 52,974
  • 7
  • 60
  • 101