0

I'm trying to write a xslt that takes a html page and transforms it so that it contains only the contents of a div tag with id "content". I'm using Apache ServiceMix to develop a service unit that performs this action but am completely lost!

So far I have created a unit that (well at least I think it does this) takes a file, applies the transformation and saves it to an output folder:

<?xml version="1.0" encoding="UTF-8"?>
<blueprint
    xmlns="http://www.osgi.org/xmlns/blueprint/v1.0.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="

http://www.osgi.org/xmlns/blueprint/v1.0.0

      http://www.osgi.org/xmlns/blueprint/v1.0.0/blueprint.xsd">

    <camelContext xmlns="http://camel.apache.org/schema/blueprint">
      <route>
        <from uri="file:camel/input"/>
        <log message="Moving ${file:name} to the output directory"/>
        <to uri="xslt:file:///transform.xsl"/>
        <to uri="file:camel/output"/>
      </route>
    </camelContext>

</blueprint>

and a transformation .xsl file:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:xhtml="http://www.w3.org/1999/xhtml"
 xmlns="http://www.w3.org/1999/xhtml"
 exclude-result-prefixes="xhtml">
    <xsl:template match="/">
        <html>
            <head><title>HTML Transformation</title></head>
            <body>
                <xsl:copy-of select="//xhtml:DIV[@id='content']"/>
            </body>
        </html>
    </xsl:template>
</xsl:stylesheet>

but it keeps throwing this error:

21:23:50,395 | INFO  | le://camel/input | route3                           | 91 - org.apache.camel.camel-core - 2.8.5 | Moving INPUTFILE.html to the output directory
21:23:50,850 | ERROR | le://camel/input | DefaultErrorHandler              | 91 - org.apache.camel.camel-core - 2.8.5 | Failed delivery for exchangeId: ID-servicemix-48257-1358413760241-2-2137. Exhausted after delivery attempt: 1 caught: javax.xml.transform.TransformerException: java.io.IOException: Server returned HTTP response code: 403 for URL: http://www.w3c.org/TR/1999/REC-html401-19991224/loose.dtd
javax.xml.transform.TransformerException: java.io.IOException: Server returned HTTP response code: 403 for URL: http://www.w3c.org/TR/1999/REC-html401-19991224/loose.dtd
    at org.apache.xalan.transformer.TransformerImpl.fatalError(TransformerImpl.java:782)[:]
    at org.apache.xalan.transformer.TransformerImpl.transform(TransformerImpl.java:756)[:]
    at org.apache.xalan.transformer.TransformerImpl.transform(TransformerImpl.java:1273)[:]
    at org.apache.xalan.transformer.TransformerImpl.transform(TransformerImpl.java:1251)[:]
    at org.apache.camel.builder.xml.XsltBuilder.process(XsltBuilder.java:123)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.impl.ProcessorEndpoint.onExchange(ProcessorEndpoint.java:102)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.impl.ProcessorEndpoint$1.process(ProcessorEndpoint.java:72)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.impl.converter.AsyncProcessorTypeConverter$ProcessorToAsyncProcessorBridge.process(AsyncProcessorTypeConverter.java:48)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.util.AsyncProcessorHelper.process(AsyncProcessorHelper.java:78)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.processor.SendProcessor$2.doInAsyncProducer(SendProcessor.java:114)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.impl.ProducerCache.doInAsyncProducer(ProducerCache.java:284)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.processor.SendProcessor.process(SendProcessor.java:109)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.util.AsyncProcessorHelper.process(AsyncProcessorHelper.java:78)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.processor.DelegateAsyncProcessor.processNext(DelegateAsyncProcessor.java:98)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.processor.DelegateAsyncProcessor.process(DelegateAsyncProcessor.java:89)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.management.InstrumentationProcessor.process(InstrumentationProcessor.java:69)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.util.AsyncProcessorHelper.process(AsyncProcessorHelper.java:78)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.processor.DelegateAsyncProcessor.processNext(DelegateAsyncProcessor.java:98)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.processor.DelegateAsyncProcessor.process(DelegateAsyncProcessor.java:89)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.processor.interceptor.TraceInterceptor.process(TraceInterceptor.java:90)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.util.AsyncProcessorHelper.process(AsyncProcessorHelper.java:78)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.processor.RedeliveryErrorHandler.processErrorHandler(RedeliveryErrorHandler.java:318)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.processor.RedeliveryErrorHandler.process(RedeliveryErrorHandler.java:209)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.processor.DefaultChannel.process(DefaultChannel.java:306)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.util.AsyncProcessorHelper.process(AsyncProcessorHelper.java:78)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.processor.Pipeline.process(Pipeline.java:116)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.processor.Pipeline.process(Pipeline.java:79)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.processor.UnitOfWorkProcessor.processAsync(UnitOfWorkProcessor.java:139)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.processor.UnitOfWorkProcessor.process(UnitOfWorkProcessor.java:106)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.util.AsyncProcessorHelper.process(AsyncProcessorHelper.java:78)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.processor.DelegateAsyncProcessor.processNext(DelegateAsyncProcessor.java:98)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.processor.DelegateAsyncProcessor.process(DelegateAsyncProcessor.java:89)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.management.InstrumentationProcessor.process(InstrumentationProcessor.java:69)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.component.file.GenericFileConsumer.processExchange(GenericFileConsumer.java:353)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.component.file.GenericFileConsumer.processBatch(GenericFileConsumer.java:176)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.component.file.GenericFileConsumer.poll(GenericFileConsumer.java:137)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.impl.ScheduledPollConsumer.doRun(ScheduledPollConsumer.java:139)[91:org.apache.camel.camel-core:2.8.5]
    at org.apache.camel.impl.ScheduledPollConsumer.run(ScheduledPollConsumer.java:91)[91:org.apache.camel.camel-core:2.8.5]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)[:1.6.0_38]
    at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)[:1.6.0_38]
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)[:1.6.0_38]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)[:1.6.0_38]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)[:1.6.0_38]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)[:1.6.0_38]
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)[:1.6.0_38]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)[:1.6.0_38]
    at java.lang.Thread.run(Thread.java:662)[:1.6.0_38]
Caused by: java.io.IOException: Server returned HTTP response code: 403 for URL: http://www.w3c.org/TR/1999/REC-html401-19991224/loose.dtd
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1459)[:1.6.0_38]
    at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source)[:]
    at org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source)[:]
    at org.apache.xerces.impl.XMLEntityManager.startDTDEntity(Unknown Source)[:]
    at org.apache.xerces.impl.XMLDTDScannerImpl.setInputSource(Unknown Source)[:]
    at org.apache.xerces.impl.XMLDocumentScannerImpl$DTDDispatcher.dispatch(Unknown Source)[:]
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)[:]
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)[:]
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)[:]
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)[:]
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)[:]
    at org.apache.xml.dtm.ref.DTMManagerDefault.getDTM(DTMManagerDefault.java:439)[:]
    at org.apache.xalan.transformer.TransformerImpl.transform(TransformerImpl.java:699)[:]
    ... 45 more

and I have no idea what it means :( Can anyone help?

Thanks heaps

рüффп
  • 5,172
  • 34
  • 67
  • 113
Pete
  • 1,095
  • 3
  • 9
  • 17
  • 2
    What's the URL of the page that you're loading into your parser to run the XSLT on it? What's the code you're using to run the XSLT? It looks like the action is failing because your parser can't find the DTD file for the page you want to load, and the DTD does indeed seem inaccessible. – JLRishe Jan 17 '13 at 01:55
  • Ah excellent, thanks heaps @JLRishe ! The !DOCTYPE tag in the webpage that I was accessing was a mess, so I just replaced it with ` ` and it worked! (Although I'm still getting errors but I'm pretty sure they're not related) – Pete Jan 17 '13 at 03:32
  • Great. Now that I know that was the issue, I've added a more thorough respondse in the answers section. :-) – JLRishe Jan 17 '13 at 03:54

1 Answers1

1

As discussed in the comments, the issue here is that the DTD in the HTML page you are accessing is referencing an inaccessible file and this is causing the parser to fail while trying to access the file.

If the HTML is something you can modify, a possible solution would be to modify the DTD (it looks like the HTML4 loose DTD is accessible at this URL: http://www.w3.org/TR/html4/loose.dtd). The parser you're using may have an option to ignore the DTD, though that may not be the best option because the documents could be using HTML-only entities like &nbsp;.

JLRishe
  • 99,490
  • 19
  • 131
  • 169