1

Why Solr for Windows needs so much memory?

My data for Solr is SEO keywords (1-10 words, up to 120 symbols length, 800 million rows) and some other data. Schema is:

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="suggests" version="1.5">
<copyField source="suggest" dest="suggest_exact"/>

<types>
    <fieldType name="text_stem" class="solr.TextField" positionIncrementGap="100">
        <analyzer>
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.SnowballPorterFilterFactory" language="Russian" />
        </analyzer>
    </fieldType>
    <fieldType name="text_exact" class="solr.TextField" positionIncrementGap="100">
        <analyzer>
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
    </fieldType>
    <fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
</types>
<fields>
    <field name="suggest" type="text_stem" indexed="true" stored="true"/>
    <field name="suggest_exact" type="text_exact" indexed="true" stored="false"/>
    <field name="length" type="int" indexed="true" stored="true"/>
    <field name="position" type="int" indexed="true" stored="true"/>
    <field name="wordstat1" type="int" indexed="true" stored="true"/>
    <field name="wordstat3" type="int" indexed="true" stored="true"/>
    <field name="ln" type="int" indexed="true" stored="true"/>
    <field name="wc" type="int" indexed="true" stored="true"/>
 </fields>

Solr for Windows eats ~10 GB of RAM and sometimes needs more (up to 16 GB). Now I configured it for using SOLR_JAVA_MEM=-Xms8192m -Xmx16384m and it works, but when it was 4 GB and less - Java crashed with error OutOfMemory.

So, what I'm doing wrong? How can I configure Solr to decrease RAM? I can provide any part of solrconfig.xml.

solrconfig.xml

<query>
    <maxBooleanClauses>1024</maxBooleanClauses>
    <filterCache class="solr.FastLRUCache"
                 size="512"
                 initialSize="512"
                 autowarmCount="0"/>
    <queryResultCache class="solr.LRUCache"
                     size="512"
                     initialSize="512"
                     autowarmCount="0"/>
    <documentCache class="solr.LRUCache"
                   size="512"
                   initialSize="512"
                   autowarmCount="0"/>
    <cache name="perSegFilter"
      class="solr.search.LRUCache"
      size="10"
      initialSize="0"
      autowarmCount="10"
      regenerator="solr.NoOpRegenerator" />

    <enableLazyFieldLoading>true</enableLazyFieldLoading>

    <queryResultWindowSize>20</queryResultWindowSize>

    <queryResultMaxDocsCached>200</queryResultMaxDocsCached>

    <useColdSearcher>false</useColdSearcher>

    <maxWarmingSearchers>2</maxWarmingSearchers>

</query>

So, what I exactly do and want.

I added 800 mln rows to Solr. And it's not all - I have datasets with 3 billion rows. Rows are SEO keywords like "job hunting", "find job in new york" etc. "suggest" field contains a lot of identical commonly used words like "job", "download" and other. I think, that the word "download" exists in 10% of all rows.

I make service, where users can to make query like "download" and to get all of documents, which contains word "download".

I created a desktop software (.NET) to communicate between web interface of service (PHP+MySQL) and Solr. This software gets a task from web service, makes query to Solr, download Solr results and provide them to user.

To get all of results I send GET-query to Solr like:

http://localhost:8983/solr/suggests2/select?q=suggest:(job%20AND%20new%20AND%20york)&fq=length:[1%20TO%2032]&fq=position:[1%20TO%2010]&fq=wc:[1%20TO%2032]&fq=ln:[1%20TO%20256]&fq=wordstat1:[0%20TO%20*]&fq=wordstat3:[1%20TO%20100000000]&sort=wordstat3%20desc&start=0&rows=100000&fl=suggest%2Clength%2Cposition%2Cwordstat1%2Cwordstat3&wt=csv&csv.separator=;

As you can see - I use fq and sorting and not use grouping. Maybe somebody see my mistakes in Solr query or approach - please, be free to tell me about that. Thanks.

devspec
  • 49
  • 10
  • Can you also provide the cache size configuration form `solrconfig.xml`? – YoungHobbit Dec 05 '15 at 14:55
  • Yes, of course. http://pastebin.com/MNhnHRBq – devspec Dec 05 '15 at 16:33
  • Actually, I don't need cache in Solr, because requests from users are totally different. – devspec Dec 05 '15 at 16:41
  • I thought it might be the large size different caches, which are consuming the memory. But you are using the default size cache only. – YoungHobbit Dec 05 '15 at 16:56
  • So, what can I do? :) – devspec Dec 05 '15 at 17:11
  • [Link-1](http://stackoverflow.com/questions/21135544/solr-on-tomcat-windows-os-consumes-all-memory) and [Link-2](http://stackoverflow.com/questions/9894687/solr-uses-too-much-memory). Please go through these link, might help you. – YoungHobbit Dec 05 '15 at 17:33
  • Thank you, I will try. It seems like exactly my proplem. Have you another suggestions? – devspec Dec 05 '15 at 20:03
  • With 800M rows, which I expect is turned into Documents, one entry in your filterCache takes up 100MB. The filterCache can hold up to 512 of those, which translates to 50GB. So whether you think you use fq or not, you are best off adjusting the size of the filterCache way down. Set it to 10 (~1GB of heap) or something like that. The usual culprits for high memory usage is grouping, sorting & faceting. Do you use any of those and if so, try to describe in detail what you do. – Toke Eskildsen Dec 06 '15 at 14:20
  • I reduced filterCache, queryResultCache, documentCache down to 16. It didn't help. Why I can't limit RAM usage to, for example, 4 GB, but to sacrifice query speed? – devspec Dec 06 '15 at 19:20
  • Please react to the second half of my previous comment. – Toke Eskildsen Dec 07 '15 at 09:17
  • Edited the question, added some information. Please, pay your attention. Thanks. – devspec Dec 07 '15 at 09:44

1 Answers1

1

You are sorting on a TrieIntField that does not have DocValues turned on. That means Solr will keep a copy of the values on the heap. With 800M values, that is 3.2GB of heap just for that. Setting docValues="true"for your wordstat3-field and re-indexing should lower that requirement considerably, at the cost of some performance.

Do note that Solr (Lucene really) does not support more than 2 billion documents in a single shard. That is a hard limit. If you plan to index 3 billion documents into the same logical index, you will have to use a multi-sharded SolrCloud.

Toke Eskildsen
  • 709
  • 4
  • 10