Backgroud:
I'm using Watson Retrieve and Rank/ or Discovery Service to retrieve information from user manuals. I performed the training with an example washing machine manual in pdf format. My target is to receive the best passages from the document where a specific natural language string occurs (like "Positioning the drain hose"). Which is working in general.
My problem is that the table of content is almost always the passage with the highest score. Therefore are the first results just the table of content instead of the relevant text passage. (See example results)
"wrong" result (table of content):
Unpacking the washing machine ----------------------------------------------------2 Overview of the washing machine --------------------------------------------------2 Selecting a location -------------------------------------------------------------------- 3 Adjusting the leveling feet ------------------------------------------------------------3 Removing the shipping bolts --------------------------------------------------------3 Connecting the water supply hose ------------------------------------------------- 3 Positioning the drain hose ----------------------------------------------------------- 4 Plugging in the machine
"correct" result
Positioning the drain hose The end of the drain hose may be positioned in three ways: Over the edge of a sink The drain hose must be placed at a height of between 60 and 90 cm. To keep the drain hose spout bent, use the supplied plastic hose
possible Solutions
- ignoring the table of content during training process
- offset parameter to e.g. ignore the first 3 results
- find out whether the result is part of table of content and ignore if YES
Those approaches are static and don't applicable for multiple documents with various structures (table of content at the beginning/ at the end/ no table of content, ...).
Has someone an idea to better approach this topic?