2

Using input file apertest.html:

<h4><a href="/" rel="nofollow" title="Vodafone anuncia su plan para actualizar la red de cable de Ono a la tecnología DOCSIS 3.1 para poder ofrecer conexiones simétricas de 1 Gbps"><span class="title">Vodafone actualizará la red de Ono para poder ofrecer 1 Gbps simétrico</span> <span class="reach">144</span> <span class="date">2016</span> </a></h4>

Running cat apertest.html | apertium -f html -u es-en, output:

<h4><a href="/" rel="nofollow" title="Vodafone Announces his plan to update the network of wire of Ono to the technology DOCSIS 3.1 to be able to offer symmetrical connections of 1 Gbps"><span class="title">Vodafone Will</span> update <span class="title">the network of Ono to be able to offer 1 Gbps symmetrical</span> <span class="reach">144</span> <span class="date">2016</span></a></h4>

I was expecting:

<h4><a href="/" rel="nofollow" title="Vodafone Announces his plan to update the network of wire of Ono to the technology DOCSIS 3.1 to be able to offer symmetrical connections of 1 Gbps"><span class="title">Vodafone Will update the network of Ono to be able to offer 1 Gbps symmetrical</span> <span class="reach">144</span> <span class="date">2016</span></a></h4>

Why is it separating sentence into three parts?

Meisser
  • 21
  • 1

1 Answers1

1

Fairly sure this is because spans are considered word-bound tags (like <em> or <b>) and not block level (like <div>). If a tag is word-bound, Apertium is free to delete or copy it. The block-level structure OTOH is always preserved.

If certain classes of spans are used as if they were block level tags, you could either preprocess (turn all <span class="title"> into div) or you could see with https://github.com/TinoDidriksen/Transfuse/ (the underlying format handling library) if it's possible to have a more nuanced handling of spans (maybe it'd make sense to have a new feature in Transfuse that lets you mark certain spans as actually being divs, if this kind of thing happens a lot). Preprocessing seems like the easiest way out though.

unhammer
  • 4,306
  • 2
  • 39
  • 52