Ignore elements in cts:search

Question

I am having some xml documents which have a structure like this:

<root>
  <intro>...</intro>
   ...
  <body>
    <p>..................
       some text CO<sub>2</sub>
       .................. </p>
   </body>
</root>

Now I want to search all the results with phrase CO2 and also want to get results of above type in search results. For this purpose, I am using this query -

cts:search 
(fn:collection ("urn:iddn:collections:searchable"), 
cts:element-query
          (
            fn:QName("http://iddn.icis.com/ns/fields","body"), 
            cts:word-query
            (
              "CO2", 
              ("case-insensitive","diacritic-sensitive","punctuation-insensitive",
                "whitespace-sensitive","unstemmed","unwildcarded","lang=en"),
              1
            )
          )
, 
("unfiltered", "score-logtfidf"), 
0.0)

But using this I am not able to get document with CO2. I am only getting data with simple phrase CO2.

If I replace the search phrase to CO 2 then I am able to get documents only with CO2 and not with CO2

I want to get combined data for both CO2 and CO2 as search results.

So can I ignore  by any means, or is there any other way to cater this problem?

score 5 · Accepted Answer · answered Nov 05 '15 at 15:13

The issue here is tokenization. "CO2" is a single word token. CO2, even with phrase-through, is a phrase of two word tokens: "CO" and "2". Just as "blackbird" does not match "black bird", so too does "CO2" not match "CO 2". The phrase-through setting just means that we're willing to look for a phrase that crosses the element boundary.

You can't splice together CO2 into one token, but you might be able to use customized tokenization overrides to break "CO2" into two tokens. Define a field and define overrides for the digits as 'symbol'. This will make each digit its own token and will break "CO2" into two tokens in the context of that field. You'd then need to replace the word-query with a field-word-query.

You probably don't want this to apply anywhere in a document, so you'd be best of adding markup around these kinds of chemical phrases in your documents. Fields in general and tokenization overrides in particular will come at a performance cost. The contents of a field are indexed completely separately so the index is bigger, and the tokenization overrides mean that we have to retokenize as well, both on ingest and at query time. This will slow things down a little (not a lot).

I find it interesting that ML tokenizes on the _{when a phrase-through (to me) suggests just ignore the markup.} — David Ennis -CleverLlamas.com, Nov 06 '15 at 17:44
Indexing is operating on a tree model, not on the character representation of the markup. So there's an intrinsic break because there is a whole different text node. I think there are cases (like this) where you want a "word-through" to splice tokens back together, but it isn't always the right thing to do. — mholstege, Nov 08 '15 at 01:44
HI Mary - Perfect - so the items in a phrase-through element are still separate node. Makes sense. Thanks! — David Ennis -CleverLlamas.com, Nov 08 '15 at 14:31

score 2 · Answer 2 · answered Nov 05 '15 at 05:58

2

It appears that you want to add a phrase-through configuration.

Example:

<p>to <b>be</b> or not to be</p>

A phrase-through on  would then be indexed as "to be or not to be"

answered Nov 05 '15 at 05:58

David Ennis -CleverLlamas.com

7,560
12
20

I already had a phrase-through created for `sub`. Even after that, it is not working as expected. – Ankit Bhardwaj Nov 05 '15 at 06:09

Ignore elements in cts:search

2 Answers2

Linked