0

Given an index created with Lucene-8, but without knowledge of the fields used, how can I programmatically extract all the fields? (I'm aware that the Luke browser can be used interactively (thanks to @andrewjames) Examples for using latest version of Lucene. ) The scenario is that, during a development phase, I have to read indexes without prescribed schemas. I'm using

IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(index)));
IndexSearcher searcher = new IndexSearcher(reader);

The reader has methods such as:

reader.getDocCount(field);

but this requires knowing the fields in advance.

I understand that documents in the index may be indexed with different fields; I'm quite prepared to iterate over all documents and extract the fields on a regular basis (these indexes are not huge).

I'm using Lucene 8.5.* so post and tutorials based on earlier Lucene versions may not work.

peter.murray.rust
  • 37,407
  • 44
  • 153
  • 217

1 Answers1

1

You can access basic field info as follows:

import java.util.List;
import java.io.IOException;
import java.nio.file.Paths;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexableField;
import org.apache.lucene.store.FSDirectory;

public class IndexDataExplorer {

    private static final String INDEX_PATH = "/path/to/index/directory";

    public static void doSearch() throws IOException {
        IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));
        for (int i = 0; i < reader.numDocs(); i++) {
            Document doc = reader.document(i);
            List<IndexableField> fields = doc.getFields();
            for (IndexableField field : fields) {
                // use these to get field-related data:
                //field.name();
                //field.fieldType().toString();
            }
        }
    }
}
andrewJames
  • 19,570
  • 8
  • 19
  • 51
  • This works, thanks . (We're using this to index the COVID-19 literature in a volunteer project http://github.com/petermr/openVirus) . – peter.murray.rust Jun 03 '20 at 10:06
  • You should also check that the document is not deleted (see reader.document javadocs): "for performance reasons, this method does not check if the requested document is deleted, and therefore asking for a deleted document may yield unspecified results". – Adrian Dec 06 '21 at 17:34