0

I have an exist-db database with a couple of (large) TEI xml files which I want to index/search. For indexing, I have an xmlpipe2 command calling a sphinx-out.xql url served by the exist db. Along with the actual texts snippets (paragraphs, headings, notes etc.), this provides a couple of attributes that I later want to use when presenting search results. One of them is a crumbtrail field that contains html (more precisely, it contains a series of <a> hyperlinks).

As I want to be able to offer sentence and paragraph operators in searching, I have set index_sp = 1 and since this in turn requires html stripping, I also have html_strip = 1. But this seems to strip the html also from my attributes, which I want to retain...

Here is what sphinx.out.xql and then the xmlpipe2 command give:

<sphinx:docset>
<sphinx:document id="77">
  <sphinx_docid>77</sphinx_docid>
  <sphinx_work>W0013</sphinx_work>
  <sphinx_author>Vitoria, Francisco de</sphinx_author>
  <sphinx_title>Relectiones</sphinx_title>
  <sphinx_year>1557</sphinx_year>
  <sphinx_crumbtrail>
    <span class="crumbtrail">
      <a href="/exist/apps/salamanca/work.html?wid=W0013#Vol02">Vol. 2</a>
      <span class="tokenizer"> &gt; </span>
      <a href="/exist/apps/salamanca/work.html?wid=W0013#Vol02Lect01">De augmento charitatis</a>
    </span>
  </sphinx_crumbtrail>
  <sphinx_description>
    <p xmlns="http://www.tei-c.org/ns/1.0" xml:id="p_l3w_pml_y4">
      [SNIP]
    </p>
  </sphinx_description>
</sphinx:document>
 .
 .
 .
</sphinx:docset>

And here is what a mysql query to sphinx gives:

mysql> select sphinx_docid, sphinx_work, sphinx_crumbtrail from salamanca_base;
+------+--------+--------------+-------------+---------------------------------+
| id   | weight | sphinx_docid | sphinx_work | sphinx_crumbtrail               |
+------+--------+--------------+-------------+---------------------------------+
  .
  .
  .
|   77 |      1 |           77 | W0013       | Vol. 2 > De augmento charitatis |
+------+--------+--------------+-------------+---------------------------------+
20 rows in set (0.00 sec)

Now I wonder if there is any way for me to disable html stripping for attributes?

Can anyone at least confirm that it is possible to store html in sphinx attributes?

Thanks for any insight

awagner
  • 107
  • 1
  • 8
  • Are you sure you need index_sp? It's NOT needed for **phrase** search. – barryhunter Nov 04 '14 at 15:00
  • I know eXist-db very well, but this question seems to be specifically about Sphinx, or have I missed something? – adamretter Nov 05 '14 at 01:51
  • @adamretter, yes I think getting sphinx to do what I want is the more obvious approach. I just wanted to put everything on the table, perhaps I should be approaching the issue from a different angle. E.g. one of the approaches I tried was base64-encoding the crumbtrail html so that sphinx does not even look inside. But it turned out (or so I think) that eXist's util:base64-encode encodes also only the actual text/string content, stripping the html no matter what, so I was no further. – awagner Nov 05 '14 at 08:30
  • @barryhunter, I'm sorry, I was confused. I meant I wanted to offer Sentence and paragraph operators. Tbh, I don't (yet) know if I am going to really *need* it, but it sure sounds nice. Will keep you updated on that as well. – awagner Nov 05 '14 at 08:33
  • I tried with ``html_index_attrs = span=class;a=href`` and even with ``index_sp = 0`` and ``html_strip = 0``, but still did not get html in attributes. Am I doing something wrong? Did anyone ever successfully use html in a sphinx attribute? – awagner Nov 05 '14 at 09:59
  • are you completely reindexing when changing settings? – barryhunter Nov 05 '14 at 11:56
  • ``time sudo -u sphinxsearch indexer --all --rotate`` should do that, shouldn't it? – awagner Nov 05 '14 at 14:38

1 Answers1

0

Maybe use html_index_attrs so that spans and a's are not removed?

html_index_attrs = span=class,a=href

barryhunter
  • 20,886
  • 3
  • 30
  • 43
  • this does not help. Then again, nothing does, not even disabling ``html_strip`` so I started wondering if I am doing anything wrong... – awagner Nov 05 '14 at 10:00