1

I have a situation where the underlying application provides a UI layer and this in turn has to be rendered as a portlet. However, I do not want all parts of the UI originally presented to be rendered in Portlet.

Proposed solution: Using Datapower for parsing an XML being a norm, I am wondering if it is possible to parse a HTML. I understand HTML may not be always well formed. But if there are very few HTML pages in underlying application, then a contract can be enforced..

Also, if one manages to parse and extract data out of HTML using DP, then the resultant (perhaps and XML) can be used to produce HTML5 with all its goodies.

So question: Is it advisable to use Datapower to parse an HTML page to extract an XML out of it? Prerequisite: number of HTML pages per application could vary in data but not with many pages.

Jared Farrish
  • 48,585
  • 17
  • 95
  • 104
emeralddove
  • 520
  • 1
  • 5
  • 11

2 Answers2

0

I suspect you will be unable to parse HTML using DataPower. DataPower can parse well-formed XML, but HTML - unless it is explicitly designed as xHTML - is likely to be full of tags that break well-formedness.

Many web pages are full of tags like <br> or <ul><li>Item1<li>Item2<li>Item3</ul>, all of which will cause the parsing to fail.

If you really want to follow your suggested approach, you'll probably need to do something on a more flexible platform such as WAS where you can build (or reuse) a parser that takes care of all of that for you.

If you think about it, this is what your web browser does - it has all the complex rules that turn badly-formed XML tags (i.e. HTML) into a valid DOM structure. It sounds like you may be better off doing manipulation at the level of the DOM rather than the HTML, as that way you can leverage existing, well-tested parsing solutions and focus on the structure of the data. You could do this client-side using JavaScript or you could look at a server-side JavaScript option such as Rhino or PhantomJS.

All of this might be doing things the hard way, though. Have you confirmed whether or not the underlying application has any APIs or web services that IT uses to render the pages, allowing you to get to the data without the existing presentation layer getting in the way?

Cheers, Chris

ChrisC
  • 2,393
  • 2
  • 18
  • 24
-1

Question of parsing and HTML page originates when you want to do some processing over it. If this is the case you can face problems because datapower by default will not allow hyperlinks inside the well formed XML or HTML document [It is considered to be a security risk], however this can be overcome with appropriate settings in XML manager present.

As far as question of HTML page parsing is concerned, Datapower being and ESB layer is expected to provide message format translation and that it indeed does. So design wise it is a good place to do message format translation. Practically however you will face above mentioned problem when you try to parse HTML as XML document.

The parsing can produce any message format model you wish [theoretically] hence you can use the XSLT to achieve what you wish.

  • Ajitabh