Why Solr for Windows needs so much memory?
My data for Solr is SEO keywords (1-10 words, up to 120 symbols length, 800 million rows) and some other data. Schema is:
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="suggests" version="1.5">
<copyField source="suggest" dest="suggest_exact"/>
<types>
<fieldType name="text_stem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Russian" />
</analyzer>
</fieldType>
<fieldType name="text_exact" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
</types>
<fields>
<field name="suggest" type="text_stem" indexed="true" stored="true"/>
<field name="suggest_exact" type="text_exact" indexed="true" stored="false"/>
<field name="length" type="int" indexed="true" stored="true"/>
<field name="position" type="int" indexed="true" stored="true"/>
<field name="wordstat1" type="int" indexed="true" stored="true"/>
<field name="wordstat3" type="int" indexed="true" stored="true"/>
<field name="ln" type="int" indexed="true" stored="true"/>
<field name="wc" type="int" indexed="true" stored="true"/>
</fields>
Solr for Windows eats ~10 GB of RAM and sometimes needs more (up to 16 GB).
Now I configured it for using SOLR_JAVA_MEM=-Xms8192m -Xmx16384m
and it works, but when it was 4 GB and less - Java crashed with error OutOfMemory.
So, what I'm doing wrong? How can I configure Solr to decrease RAM?
I can provide any part of solrconfig.xml
.
solrconfig.xml
<query>
<maxBooleanClauses>1024</maxBooleanClauses>
<filterCache class="solr.FastLRUCache"
size="512"
initialSize="512"
autowarmCount="0"/>
<queryResultCache class="solr.LRUCache"
size="512"
initialSize="512"
autowarmCount="0"/>
<documentCache class="solr.LRUCache"
size="512"
initialSize="512"
autowarmCount="0"/>
<cache name="perSegFilter"
class="solr.search.LRUCache"
size="10"
initialSize="0"
autowarmCount="10"
regenerator="solr.NoOpRegenerator" />
<enableLazyFieldLoading>true</enableLazyFieldLoading>
<queryResultWindowSize>20</queryResultWindowSize>
<queryResultMaxDocsCached>200</queryResultMaxDocsCached>
<useColdSearcher>false</useColdSearcher>
<maxWarmingSearchers>2</maxWarmingSearchers>
</query>
So, what I exactly do and want.
I added 800 mln rows to Solr. And it's not all - I have datasets with 3 billion rows. Rows are SEO keywords like "job hunting", "find job in new york" etc. "suggest" field contains a lot of identical commonly used words like "job", "download" and other. I think, that the word "download" exists in 10% of all rows.
I make service, where users can to make query like "download" and to get all of documents, which contains word "download".
I created a desktop software (.NET) to communicate between web interface of service (PHP+MySQL) and Solr. This software gets a task from web service, makes query to Solr, download Solr results and provide them to user.
To get all of results I send GET-query to Solr like:
http://localhost:8983/solr/suggests2/select?q=suggest:(job%20AND%20new%20AND%20york)&fq=length:[1%20TO%2032]&fq=position:[1%20TO%2010]&fq=wc:[1%20TO%2032]&fq=ln:[1%20TO%20256]&fq=wordstat1:[0%20TO%20*]&fq=wordstat3:[1%20TO%20100000000]&sort=wordstat3%20desc&start=0&rows=100000&fl=suggest%2Clength%2Cposition%2Cwordstat1%2Cwordstat3&wt=csv&csv.separator=;
As you can see - I use fq and sorting and not use grouping. Maybe somebody see my mistakes in Solr query or approach - please, be free to tell me about that. Thanks.