0

I've been trying to index a lot of documents on Solr (~200 million docs). I use Pysolr to do the indexing. However, the Solr server keeps going down while indexing (sometimes after 100 million documents have been indexed, sometimes after ~180 million documents, it differs). I'm not sure why this is happening, is it because of the open size limit, i.e., related to the warning I get while starting the server with bin/solr start?

* [WARN] * Your open file limit is currently 1024. It should be set to 65000 to avoid operational disruption.

I used multiprocessing while indexing with chunks of 25000 (but I also tried with bigger chunks and without multiprocessing and it still crashed). Is it because there are too many requests being sent to Solr? My Python code is below.

solr = pysolr.Solr('http://localhost:8983/solr/collection_name', always_commit=True)

def insert_into_solr(filepath):
    """ Inserts records into an empty solr index which has already been created."""
    record_number = 0
    list_for_solr = []
    with open(filepath, "r") as file:
        csv_reader = csv.reader((line.replace('\0', '') for line in file), delimiter='\t', quoting=csv.QUOTE_NONE)
        for paper_id, paper_reference_id, context in csv_reader:
            # int, int, string
            record_number += 1
            solr_record = {}
            solr_record['paper_id'] = paper_id
            solr_record['reference_id'] = reference_id
            solr_record['context'] = context
            # Chunks of 25000
            if record_number % 25000 == 0:
                list_for_solr.append(solr_record)
                try:
                    solr.add(list_for_solr)
                except Exception as e:
                    print(e, record_number, filepath)
                list_for_solr = []
                print(record_number)
            else:
                list_for_solr.append(solr_record)
        try:
            solr.add(list_for_solr)
        except Exception as e:
            print(e, record_number, filepath)

def create_concurrent_futures():
    """ Uses all the cores to do the parsing and inserting"""
    folderpath = '.../'
    refs_files = glob(os.path.join(folderpath, '*.txt'))
    with concurrent.futures.ProcessPoolExecutor() as executor:
        executor.map(insert_into_solr, refs_files, chunksize=1)

if __name__ == '__main__':
    create_concurrent_futures()

I read somewhere that the standard Solr installation has a hard limit of around 2.14 billion documents. Is it better to use Solrcloud (which I have never configured) when there are 100s of millions of docs? Will it help with this problem? (I also have another file with 1.4 Billion documents which needs to be indexed after this). I have only one server, is there any point trying to configure Solrcloud?

ash
  • 63
  • 1
  • 8
  • 1
    An easy test is to change the ulimit and see if it helps - see [File handles and processes - ulimit settings](https://lucene.apache.org/solr/guide/7_3/taking-solr-to-production.html#file-handles-and-processes-ulimit-settings) for information. Using SolrCloud is helpful when you want to spread the set of documents across multiple servers. The 2.1b limit is per shard / core, so using a collection in SolrCloud with multiple servers (even if they're running on a single machine but with different working directories) will allow you to scale that up further. – MatsLindh Nov 15 '18 at 20:20
  • Thanks @MatsLindh . I wanted to find out before asking the the sysadmin to increase the ulimit. I setup Solr using the method in 'Getting started' (i.e., I just extracted it) rather than following the process in the 'Take Solr to production' page). I suppose using default configs might also be contributing to this issue? So from what I understand, its probably best to configure and use Solrcloud when there are so many documents, isn't it? – ash Nov 15 '18 at 21:10
  • 1
    That depends. It can be - if the amount of queries or total number of documents requires it. It's also easier to scale in the future if necessary, but for prototyping and hosting something for data exploration with a small number of users, it's probably not. – MatsLindh Nov 16 '18 at 09:14
  • Thanks @MatsLindh for the advice. That's very helpful. – ash Nov 17 '18 at 02:40

0 Answers0