IBM Watson - Retrieve and Rank: How to tell that a text in a PDF document should be considered a field?

Question

I am loading lots of PDF documents in a Retrieve and Rank service but I do not know to to tell Solr or IBM Retrieve and Rank service that a specific part of my PDF document should be considered as a field for later query, for example, a name, or a document process id.

Welcome to Stack Overflow! Please review our [SO Question Checklist](http://meta.stackoverflow.com/questions/260648/stack-overflow-question-checklist) to help you to ask a good question, and thus get a good answer. — Joe C, Oct 20 '16 at 20:36

score 0 · Accepted Answer · answered Oct 21 '16 at 10:43

You can't do this when uploading documents using the web-based UI, as this only populates some default fields like body and title.

But you can programmatically add the contents of your PDF documents to the R&R collection. And when you do this, you're free to add any fields you want.

E.g. from the documentation at https://www.ibm.com/watson/developercloud/retrieve-and-rank/api/v1/?java#index_doc

RetrieveAndRank service = new RetrieveAndRank();
service.setUsernameAndPassword("{username}","{password}");

SolrInputDocument newdoc = new SolrInputDocument();
document.addField("id", 1);
document.addField("author", "brenckman,m.");
document.addField("bibliography", "j. ae. scs. 25, 1958, 324.");
etc... 

UpdateResponse addResponse = solrClient.add("example_collection", newdoc);

solrClient.commit("example_collection");

In the same way that this example is using author and bibliography as additional field names, you can add new ones such as a process id.

You'll need to update the schema for your R&R collection to specify these new fields. You can use the schema at https://github.com/IBM-Watson/kale/blob/master/solr/knowledge-expansion-en.xml#L36 as an example for how to specify additional fields.

IBM Watson - Retrieve and Rank: How to tell that a text in a PDF document should be considered a field?

1 Answers1