Watson Retrieve and Rank/ Discovery Service return always table of content with high(est) score

Question

Backgroud:

I'm using Watson Retrieve and Rank/ or Discovery Service to retrieve information from user manuals. I performed the training with an example washing machine manual in pdf format. My target is to receive the best passages from the document where a specific natural language string occurs (like "Positioning the drain hose"). Which is working in general.

My problem is that the table of content is almost always the passage with the highest score. Therefore are the first results just the table of content instead of the relevant text passage. (See example results)

"wrong" result (table of content):

Unpacking the washing machine ----------------------------------------------------2 Overview of the washing machine --------------------------------------------------2 Selecting a location -------------------------------------------------------------------- 3 Adjusting the leveling feet ------------------------------------------------------------3 Removing the shipping bolts --------------------------------------------------------3 Connecting the water supply hose ------------------------------------------------- 3 Positioning the drain hose ----------------------------------------------------------- 4 Plugging in the machine

"correct" result

Positioning the drain hose The end of the drain hose may be positioned in three ways: Over the edge of a sink The drain hose must be placed at a height of between 60 and 90 cm. To keep the drain hose spout bent, use the supplied plastic hose

possible Solutions

ignoring the table of content during training process
offset parameter to e.g. ignore the first 3 results
find out whether the result is part of table of content and ignore if YES

Those approaches are static and don't applicable for multiple documents with various structures (table of content at the beginning/ at the end/ no table of content, ...).

Has someone an idea to better approach this topic?

score 0 · Answer 1 · answered Aug 16 '17 at 14:27

At this time, passage retrieval results are not affected by relevancy training. As passage retrieval always searches the entire corpus, unfortunately the only reliable way of excluding passage retrieval results from a table of contents is to remove the table of contents.

Watson Retrieve and Rank/ Discovery Service return always table of content with high(est) score

Backgroud:

"wrong" result (table of content):

"correct" result

possible Solutions

1 Answers1