3

I am using lucene.net to index my pdf files. It takes about 40 minutes to index 15000 pdfs and indexing time increase with number of pdf file increase in my folder.

  • how can I improve indexing speed in lucene.net?
  • Is there any other indexing service with fast indexing performance?

I am using latest version of lucene.net indexing (Lucene.net 3.0.3).

Here is my code for indexing.

public void refreshIndexes() 
        {
            // Create Index Writer
            string strIndexDir = @"E:\LuceneTest\index";
            IndexWriter writer = new IndexWriter(Lucene.Net.Store.FSDirectory.Open(new System.IO.DirectoryInfo(strIndexDir)), new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29), true, IndexWriter.MaxFieldLength.UNLIMITED);

            // Find all files in root folder create index on them
            List<string> lstFiles = searchFiles(@"E:\LuceneTest\PDFs");
            foreach (string strFile in lstFiles)
            {
                Document doc = new Document();
                string FileName = System.IO.Path.GetFileNameWithoutExtension(strFile);
                string Text = ExtractTextFromPdf(strFile);
                string Path = strFile;
                string ModifiedDate = Convert.ToString(File.GetLastWriteTime(strFile));
                string DocumentType = string.Empty;
                string Vault = string.Empty;

                string headerText = Text.Substring(0, Text.Length < 150 ? Text.Length : 150);
                foreach (var docs in ltDocumentTypes)
                {
                    if (headerText.ToUpper().Contains(docs.searchText.ToUpper()))
                    {
                        DocumentType = docs.DocumentType;
                        Vault = docs.VaultName; ;
                    }
                }

                if (string.IsNullOrEmpty(DocumentType))
                {
                    DocumentType = "Default";
                    Vault = "Default";
                }

                doc.Add(new Field("filename", FileName, Field.Store.YES, Field.Index.ANALYZED));
                doc.Add(new Field("text", Text, Field.Store.YES, Field.Index.ANALYZED));
                doc.Add(new Field("path", Path, Field.Store.YES, Field.Index.NOT_ANALYZED));
                doc.Add(new Field("modifieddate", ModifiedDate, Field.Store.YES, Field.Index.ANALYZED));
                doc.Add(new Field("documenttype", DocumentType, Field.Store.YES, Field.Index.ANALYZED));
                doc.Add(new Field("vault", Vault, Field.Store.YES, Field.Index.ANALYZED));

                writer.AddDocument(doc);
            }
            writer.Optimize();
            writer.Dispose();
        }
Munavvar
  • 802
  • 1
  • 11
  • 33
  • Do you really need to call `writer.Optimize()`? Wouldn't a `writer.Commit()` be enough? – sisve Aug 01 '16 at 07:27
  • thanks for reply @SimonSvensson. Optimize() is not necessary. tried by commit(), no improvement in performance. – Munavvar Aug 01 '16 at 09:53
  • 1
    @Munavvar, before proposing any changes, did you try adding some benchmark for relevant methods? I would be particularly interested in searchFiles and ExtractTextFromPdf methods. I believe the issue maybe in the latter as your code looks OK (apart from dates that shouldn't be analyzed). Moreover what's the size of your PDFs? You can restrict indexing and analysis to a relevant number of chars. – AR1 Aug 01 '16 at 17:47

1 Answers1

2

The indexing part looks ok. Note that IndexWriter is threadsafe so using Parallel.Foreach (with MaxConcurrency set to the number of cores. play with this value) will probably help if you're on a multicore machine.

But you're making your GC crazy with the document type detection part. All the ToUpper()s is painful.

  • Outside of the lstFiles loop. Create a copy of ltDocumentTypes .searchText in upper case

    var upperDocTypes = ltDocumentTypes.Select(x=>x.searchText.ToUpper()).ToList();
    
  • outside of the doc types loop create another string

    string headerTestUpper = headerText.ToUpper();
    
  • When it finds a match "break". This exits the loop once you've found a match and prevents all the following iterations. Of course this means match first whereas yours is match last (if that makes a difference to you)

    string headerText = Text.Substring(0, Text.Length < 150 ? Text.Length : 150);
    foreach (var searchText in upperDocTypes)
    {
        if (headerTextUpper.Contains(searchText))
        {
            DocumentType = docs.DocumentType;
            Vault = docs.VaultName;
            break;
        }
    }
    

Depending on the size of ltDocumentTypes this may not give you much improvement.

I would bet that the most expensive part if ExtractTextFromPdf. Running this through a profiler or instrumenting with some StopWatches will should you where the cost is.

AndyPook
  • 2,762
  • 20
  • 23