0

I am using logstash with the JDBC driver to bulk import a bunch of data from SQL Server to Elasticsearch. (The end goal is to have this data be searchable from a web front-end.)

One of the table columns contains HTML tags (<span id='blah'>, <p class='foo'>, etc). I want the content to be searchable, but the tags to be ignored. That is if someone searches for the word "foo", the document that contains <p class='foo'> should NOT come up. On the other hand, I DO want the full content, including markup, to stored in Elasticsearch.

Is there something I can do in my logstash .config file to make Elasticsearch "aware" that this is HTML content?

anon
  • 4,578
  • 3
  • 35
  • 54
  • Perhaps the question can be summarized as "make field unsearchable"? With the field to make unsearchable is the field containing the tag. – baudsp May 05 '17 at 15:51
  • To make a field unsearchable, I think it is possible to use the `index : false` option of a mapping, on the field with the tags. – baudsp May 05 '17 at 16:00
  • @baudsp - To clarify, I want the tags to be ignored, but the "real" content SHOULD be searchable. I will edit my question. – anon May 05 '17 at 16:06
  • i've never done it before, but I think what you are looking for is the `html_strip` filter/analyzer. It would be applied on the ES side, not logstash. – Alcanzar May 05 '17 at 22:15
  • From what I understood, all the tags (`

    `) are in just one column, with the real content in another column. With the jdbc plugin, each column from the table is a field in a document, so that's why I said to make that field not indexed.

    – baudsp May 07 '17 at 15:05

0 Answers0