2

I am having some xml documents which have a structure like this:

<root>
  <intro>...</intro>
   ...
  <body>
    <p>..................
       some text CO<sub>2</sub>
       .................. </p>
   </body>
</root>

Now I want to search all the results with phrase CO2 and also want to get results of above type in search results. For this purpose, I am using this query -

cts:search 
(fn:collection ("urn:iddn:collections:searchable"), 
cts:element-query
          (
            fn:QName("http://iddn.icis.com/ns/fields","body"), 
            cts:word-query
            (
              "CO2", 
              ("case-insensitive","diacritic-sensitive","punctuation-insensitive",
                "whitespace-sensitive","unstemmed","unwildcarded","lang=en"),
              1
            )
          )
, 
("unfiltered", "score-logtfidf"), 
0.0)

But using this I am not able to get document with CO<sub>2</sub>. I am only getting data with simple phrase CO2.

If I replace the search phrase to CO 2 then I am able to get documents only with CO<sub>2</sub> and not with CO2

I want to get combined data for both CO<sub>2</sub> and CO2 as search results.

So can I ignore <sub> by any means, or is there any other way to cater this problem?

Ankit Bhardwaj
  • 754
  • 8
  • 27

2 Answers2

5

The issue here is tokenization. "CO2" is a single word token. CO<sub>2</sub>, even with phrase-through, is a phrase of two word tokens: "CO" and "2". Just as "blackbird" does not match "black bird", so too does "CO2" not match "CO 2". The phrase-through setting just means that we're willing to look for a phrase that crosses the <sub> element boundary.

You can't splice together CO<sub>2</sub> into one token, but you might be able to use customized tokenization overrides to break "CO2" into two tokens. Define a field and define overrides for the digits as 'symbol'. This will make each digit its own token and will break "CO2" into two tokens in the context of that field. You'd then need to replace the word-query with a field-word-query.

You probably don't want this to apply anywhere in a document, so you'd be best of adding markup around these kinds of chemical phrases in your documents. Fields in general and tokenization overrides in particular will come at a performance cost. The contents of a field are indexed completely separately so the index is bigger, and the tokenization overrides mean that we have to retokenize as well, both on ingest and at query time. This will slow things down a little (not a lot).

mholstege
  • 4,902
  • 11
  • 7
  • 1
    I find it interesting that ML tokenizes on the when a phrase-through (to me) suggests just ignore the markup. – David Ennis -CleverLlamas.com Nov 06 '15 at 17:44
  • 1
    Indexing is operating on a tree model, not on the character representation of the markup. So there's an intrinsic break because there is a whole different text node. I think there are cases (like this) where you want a "word-through" to splice tokens back together, but it isn't always the right thing to do. – mholstege Nov 08 '15 at 01:44
  • HI Mary - Perfect - so the items in a phrase-through element are still separate node. Makes sense. Thanks! – David Ennis -CleverLlamas.com Nov 08 '15 at 14:31
2

It appears that you want to add a phrase-through configuration.

Example:

<p>to <b>be</b> or not to be</p> 

A phrase-through on <b> would then be indexed as "to be or not to be"