CFSearch + Solr: Ignoring HTML in Searches

Question

I have a problem and Google hasn't helped me much. I'm trying figure out a way to ignore HTML while searching a Solr index in ColdFusion (9).

For example, if I search for microsoft and my index contains Microsoft© makes Windows® I'm prompted to search for "Microsoft© makes Windows®" rather than showing the actual result.

As you can see below, I'm just passing the string into the criteria property of cfsearch - but again - doing this produces (what I consider to be) a "dirty" result.

  <cfsearch
      collection="mycollection"
      criteria="microsoft"
      name="results"
      maxrows="100"
      suggestions="always"
      contexthighlightbegin="<strong>"
      contextHighlightEnd="</strong>"
      contextPassages="3"
      />

I've been looking at the documentation for Solr's query syntax but I don't see anything that jumps out at me on how to avoid this problem.

Should I look at providing the index a "flat" version of text or is there a way to avoid HTML strings such as © / ® / ™?

I'm open to suggestions.

-- Brian.

I'm using CF10 which should be using Solr 3.4 according to http://www.corporatezen.com/2013/11/updating-solr-engine-coldfusion/. I added `` to `` but the search result still returns HTML. Any idea why? — Henry, Feb 24 '15 at 02:21

score 3 · Answer 1 · edited May 23 '17 at 11:56

3

Check if the Solr field you're using to search is set up with String field type and not with Text (which admits tokenization and other text analysis). See this question for more information about this.

In case it is really a problem of stripping HTML, you'll have to add HTMLStripCharFilterFactory to your field type configuration, which strips HTML tags from the indexed field.

edited May 23 '17 at 11:56

Community

1
1

answered Feb 26 '12 at 03:01

Mauricio Scheffer

98,863
23
192
275

Thanks Mauricio. From what you posted, it looks like there is no (relatively) easy way of doing this with ColdFusion + Solr so I think I'm just going to strip out HTML before indexing. – NotJustClarkKent Feb 27 '12 at 16:53
The only issue with HTMLStripCharFilterFactory or any other charFilter is that these were introduced in Solr 1.4, and CF 9 runs Solr 1.3 by default. Even CF 9.0.1 appears to run a pre-1.4 Solr release. That's not to say that one can't upgrade because one certainly can, only that won't be what is running on CF 9 out of the box. – David Faber Mar 01 '12 at 14:31
@DavidFaber : wow, Solr 1.3 is now ~3.5 years old. That's *a lot* in Solr years :) – Mauricio Scheffer Mar 01 '12 at 14:59

NotJustClarkKent · Accepted Answer · 2012-02-27T18:23:18.447

For anyone that might be faced with the same question:

The solution for this question was to use an alternate method of indexing rather than trying to work around the HTML within the index.

Within the database I created a new field called index_search and on my insert method within my application I used a regex to omit any special(er) characters: "[^[:word:].[:space:]-]"

From there, I passed the index_search field to the body of cfindex and used the HTML name as the title:

  <cfindex
    collection="mycollection"
    action="update"
    body="name_search,html_description"
    title="name_html"
    key="UUID"
    query="data">

Using this method produced the expected output when searching for words or phrases close to, or, wrapped in HTML. IE: Searching microsoft would lists all results with Microsoft© within it.

CFSearch + Solr: Ignoring HTML in Searches

2 Answers2