I found an easy way to search through relational data in Solr, but I am not sure if I should to optimise it further.
Let me give you an example: Say, that we have a system, where users organize books in personal collections. A Book has a genre, e.g. "Drama", "Thriller", "Horror", etc. A user collection may, and in most cases, it does, contain books from different genres.
If I want to create a search, where users can search through collections by genre, I'd like to return the results which contain books most relevant to the genre query. What I did was a simple trick - I added a search field for the collection, named "genres", which is a concatenated string of the genres of all books in that collection. This string field is created at index time. It makes a lot of sense, because, if a collection contains 30 "Thriller" and 20 "Comedy" books, in a search for "Thriller" it will appear as a more relevant result than in a search for "Comedy".
As you can guess, however, the "genres" field ends up having a lot of duplicate terms. Since it is only use behind the scenes, and not displayed anywhere, this is not so much a data integrity than an optimization problem IMHO.
I am particularly new to Solr. I am aware of how it works, and I assume that at the time of building the inverted index, each and every term gets associated with a simple frequency count. Technically, if the "genres" field consists of 100 terms or 10000 terms, 9500 of which are "Thriller" it should still not matter much for the indexing and querying speed, right?
If I am wrong, then does a syntax exist, where boosts can be given even at the input text? Say, if instead of 10000 terms, the "genres" field looked like:
"Thriller^8500 Comedy^125 Drama^12"