Description (for reference):
I want to index an entire drive of files : ~2TB
I'm getting the list of files (Using commons io library).
Once I have the list of files, I go through each file and extract readable data from that using Apache Tika
Once I have the data I'm indexing it using solr.
I'm using solrj with the java application
My question is: How do I decide what size of collection to pass to Solr. I've tried passing in different sizes with different results i.e. sometimes 150 documents per collection performs better than 100 documents but sometimes they do not. Is their an optimal way / configuration that you can tweak as this process has to be carried repeatedly.
Complications :
1) Files are stored on a network drive, retrieving the filenames/files takes some time too.
2) Both this program (java app) and solr itself cannot use more than 512MB of ram