Should I worry about optimizing a large Solr field, with lots of duplicate terms?

Question

I found an easy way to search through relational data in Solr, but I am not sure if I should to optimise it further.

Let me give you an example: Say, that we have a system, where users organize books in personal collections. A Book has a genre, e.g. "Drama", "Thriller", "Horror", etc. A user collection may, and in most cases, it does, contain books from different genres.

If I want to create a search, where users can search through collections by genre, I'd like to return the results which contain books most relevant to the genre query. What I did was a simple trick - I added a search field for the collection, named "genres", which is a concatenated string of the genres of all books in that collection. This string field is created at index time. It makes a lot of sense, because, if a collection contains 30 "Thriller" and 20 "Comedy" books, in a search for "Thriller" it will appear as a more relevant result than in a search for "Comedy".

As you can guess, however, the "genres" field ends up having a lot of duplicate terms. Since it is only use behind the scenes, and not displayed anywhere, this is not so much a data integrity than an optimization problem IMHO.

I am particularly new to Solr. I am aware of how it works, and I assume that at the time of building the inverted index, each and every term gets associated with a simple frequency count. Technically, if the "genres" field consists of 100 terms or 10000 terms, 9500 of which are "Thriller" it should still not matter much for the indexing and querying speed, right?

If I am wrong, then does a syntax exist, where boosts can be given even at the input text? Say, if instead of 10000 terms, the "genres" field looked like:

"Thriller^8500 Comedy^125 Drama^12"

Nikolay · Answer 1 · 2013-11-02T14:45:34.907

0

You should use payloads feature of Solr, that allow boosting words in text. For example check http://sujitpal.blogspot.ru/2011/01/payloads-with-solr.html

Regards to your approach: all will be good if stored, termPositions, termOffsets field attributes are set to false.

edited Nov 02 '13 at 14:45

answered Nov 02 '13 at 14:40

Nikolay

1,949
18
26

stored=false is supposed to keep the index size small, right? Since we do not need to display that field anyway, we do not need to keep it stored, is this the logic there? – Preslav Rachev Nov 02 '13 at 16:45
Yes, we need to keep only a term vector. – Nikolay Nov 02 '13 at 16:55
Yes, the payloads approach seems interesting, but it is not going to have that much of an advantage, right? That is, once I get rid of the stored=true. Then, the term vector will only keep the term counter as a reference – Preslav Rachev Nov 03 '13 at 21:49
Yes, but passing 10000 term string to Solr isn't looking pretty. But works, indeed. – Nikolay Nov 03 '13 at 22:19

Should I worry about optimizing a large Solr field, with lots of duplicate terms?

1 Answers1