1

I am writing a plugin for a web application that takes user provided HTML and transforms it to a different piece of HTML code. I mostly want to find all elements with given class/content ("directives") and rewrite it to something else. I am using Scala 2.11.1 and TagSoup parser to deal with XML-unfriendly code.

My main problem at the moment is that the call to XML.parseString("<div></div>") yields:

scala> XML.loadString("<div></div>")
res2: scala.xml.Elem = <div/>

This behaviour garbles the resulting page (i.e. iframes, divs etc.) as I want to leave this tags unminimized. Is there a way to avoid this behaviour in the loading phase?

The second problem is related to TagSoup. When parsing a block of code like:

<script type="javascript">console.log("Hello");</script>

TagSoup parses it as

<script type="javascript">console.log(&quot;Hello&quot;);</script>

Is there anything that can be done to avoid these problems? I have come up only with "nasty" solutions so far like rewriting all elements to be unminimized and removing all entities from the content of <script> tags.


The TagSoup parsing is done like this:

import java.net.URL

import org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl
import org.xml.sax
import org.xml.sax.InputSource

import scala.xml._
import parsing.NoBindingFactoryAdapter

object HTML {
  lazy val adapter = new NoBindingFactoryAdapter
  lazy val parser  = (new SAXFactoryImpl).newSAXParser()

  def load(source: InputSource) = adapter.loadXML(source, parser)
  def loadString(source: String) = load(Source.fromString(source))
  def loadURL(url: URL) = load(new sax.InputSource(url.openConnection().getInputStream))
}
Karel Horak
  • 1,022
  • 1
  • 8
  • 19
  • Try [htmlparser](http://about.validator.nu/htmlparser/), see [Scala HTML parser object usage](http://stackoverflow.com/a/11424036/651140) – Andrzej Jozwik Sep 04 '14 at 07:36
  • Thank you for your suggestion! I have tried this parser, but seemingly it replaces all html special chars by HTML entities too. I will go through the settings to see if this behaviour could be turned off. – Karel Horak Sep 04 '14 at 08:04
  • The minimization of the empty tag is likely caused by `NoBindingFactoryAdapter`'s methods `create` and `createNode`. Both of them contains `children.isEmpty` as the value for `minimizeEmpty`. – Karel Horak Sep 04 '14 at 08:42
  • alternatively you could use ANTLR4, but in my opinion would be nothing but overkill – petrbel Sep 07 '14 at 19:42

0 Answers0