0

I'm implementing a solution using Watson's Retrieve & Rank service.

When I use the tooling interface, I upload my documents and they appear as a list, where I can click on any of them to open up all the Titles that are inside the document ( Answer Units ), as you can see on the Picture 1 and Picture 2.

When I try to upload documents via Java, it wont recognize the documents, they get uploaded in parts ( Answer units as documents ), each part as a new document.

I would like to know how can I upload my documents as a entire document and not only parts of it?

Here's the codes for the upload function in Java:

    public Answers ConvertToUnits(File doc, String collection) throws ParseException, SolrServerException, IOException{
    DC.setUsernameAndPassword(USERNAME,PASSWORD);
    Answers response = DC.convertDocumentToAnswer(doc).execute();
    SolrInputDocument newdoc = new SolrInputDocument();
    WatsonProcessing wp = new WatsonProcessing();
    Collection<SolrInputDocument> newdocs = new ArrayList<SolrInputDocument>();

    for(int i=0; i<response.getAnswerUnits().size(); i++)
    {
        String titulo = response.getAnswerUnits().get(i).getTitle();
        String id = response.getAnswerUnits().get(i).getId();
        newdoc.addField("title", titulo);
        for(int j=0; j<response.getAnswerUnits().get(i).getContent().size(); j++)
        {
            String texto = response.getAnswerUnits().get(i).getContent().get(j).getText();
            newdoc.addField("body", texto);

        }
        wp.IndexDocument(newdoc,collection);
        newdoc.clear();
    }
    wp.ComitChanges(collection);
    return response;
}


      public void IndexDocument(SolrInputDocument newdoc, String collection) throws SolrServerException, IOException
  {
      UpdateRequest update = new UpdateRequest();
      update.add(newdoc);
      UpdateResponse addResponse = solrClient.add(collection, newdoc);
  }
Archerspk
  • 137
  • 11

1 Answers1

1

You can specify config options in this line:

Answers response = DC.convertDocumentToAnswer(doc).execute();

I think something like this should do the trick:

String configAsString = "{ \"conversion_target\":\"answer_units\", \"answer_units\": { \"selector_tags\": [] } }";

JsonParser jsonParser = new JsonParser();
JsonObject customConfig = jsonParser.parse(configAsString).getAsJsonObject();    

Answers response = DC.convertDocumentToAnswer(doc, null, customConfig).execute();

I've not tried it out, so might not have got the syntax exactly right, but hopefully this will get you on the right track.

Essentially, what I'm trying to do here is use the selector_tags option in the config (see https://www.ibm.com/watson/developercloud/doc/document-conversion/customizing.shtml#htmlau for doc on this) to specify which tags the document should be split on. By specifying an empty list with no tags in, it results in it not being split at all - and coming out in a single answer unit as you want.

(Note that you can do this through the tooling interface, too - by unticking the "Split my documents up into individual answers for me" option when you upload the document)

dalelane
  • 2,746
  • 1
  • 24
  • 27
  • Hi, thanks for the answer and sorry for the delay, The problem is that I also need it to be broken into "Titles" and "Bodies", just like Retrieve & Rank tooling so that I can use this to search for the info I need. I've noticed on some other examples that people uses a tag like "Source" or "Topic" to fix multiple Titles/Bodies into an specific document, but Retrieve & Rank doesn't seem to understand such tags, do you know if there's any Tags that I can use to specify the Document Source which the answer unit come from that Retrieve & Rank understand ? – Archerspk Sep 06 '16 at 19:16
  • You can add any other fields to your collection's schema, and then include them when you index the document. You've got newdoc.addField("title", titulo); in your example If you add additional fields to the schema, you can add many more addField lines. And you're right - one possible use for this could be to store something about the document that an answer unit came from. The tooling does do this, in order to be able to display all the answer units that came from a document together. – dalelane Sep 06 '16 at 20:44
  • That's great, I'm using the tooling config/schema as my default schema, but I can't find any tags on the schema to specify the document source, do you know which tag is it? – Archerspk Sep 08 '16 at 14:24
  • The field `sourceDocId` is used by the tooling to hold the ID of the document that an individual passage came from, however these IDs are generated and managed by the tooling itself. So I don't think there will be a straightforward way for you to do this from Solr alone. More generally, I think that reusing fields that are used internally by the tooling is probably A Bad Idea, and certainly unsupported. – dalelane Sep 08 '16 at 15:43