8

I'd like to use PHP Tidy to ensure my xml is valid before I load it into a DomDocument.

However, I don't want Tidy to change something to my formatting - I only want it to repair problems like unbalanced tags, etc.

An example of the problem can be seen at this page: http://www.tek-tips.com/viewthread.cfm?qid=1654452

My own example is the following.

Input: <ex><context>собр<stress>а</stress>ние</context> акцион<stress>е</stress>ров — <stress>aa</stress>ndeelhoudersvergadering</ex> (which is valid xml already)

Expected output: <ex><context>собр<stress>а</stress>ние</context> акцион<stress>е</stress>ров — <stress>aa</stress>ndeelhoudersvergadering</ex> (there is breaking whitespace between </context> and актион)

Actual output:

<ex>
<context>собр
<stress>а</stress>ние</context>акцион
<stress>е</stress>ров — 
<stress>aa</stress>ndeelhoudersvergadering</ex>

(it removed the space between </context> and актион which will make the text unreadable, and it inserted newlines after each tag)

My code is:

function TidyXml($inputXml)
    {
        $config = array(
            'indent'         => false,
            'output-xml'     => true,
            'input-xml'     => true,
        );

        $tidy = new tidy();
        $tidy->parseString($inputXml, $config, 'utf8');
        $tidy->cleanRepair();
        $cleanXml = tidy_get_output($tidy);
        return $cleanXml;
    }

I tried changing several options, but didn't succeed.

hakre
  • 193,403
  • 52
  • 435
  • 836
hansmbakker
  • 1,108
  • 14
  • 29
  • http://tidy.sourceforge.net/docs/quickref.html#output-xml – hakre Mar 01 '13 at 08:53
  • PHP Simple HTML DOM Parser is a much more lenient parser than most. http://simplehtmldom.sourceforge.net/ – Petah Mar 01 '13 at 08:54
  • @hakre I removed all settings except for `'input-xml' => true` (needed because otherwise it will output a complete HTML document). However, it didn't help. Also I tried setting `'output-xml' => false`, but this didn't help. Can anything be done to prevent stripping / trimming and formatting? – hansmbakker Mar 01 '13 at 19:57
  • I found http://stackoverflow.com/questions/4048234/no-linebreak-after-tags-in-tidy - but it seems strange to me that it's impossible to switch off formatting (newlines and trimming) – hansmbakker Mar 01 '13 at 20:01
  • @Petah it seems to be more html-oriented. I tried it, but it does not fix the broken xml I fed it. For example `geog.` should be fixed to `geog.` so that the tags are balanced. – hansmbakker Mar 01 '13 at 20:20

2 Answers2

5

I found a solution, but it is a bit hackish, so I'm still open for better suggestions.

Put <pre> around the xml you want to validate (this instructs Tidy not to change the whitespace), then repair the xml with output-html set to true, then remove the <pre> and \n newlines.

Example:

$config = array(
    'indent' => false,
    'indent-attributes' => false,
    'output-html' => true,
    'input-xml' => true,
    'wrap' => 0,  
    'vertical-space' => false,  
    'new-inline-tags' => 'context,abr,stress',  
    'new-blocklevel-tags'   => 'def,ex,examples'
);

$tidy = new tidy();
$inputXml = "<pre>" . $inputXml . "</pre>";
$validXml = $tidy->repairString($inputXml, $config, 'utf8');
$cleanXml = str_replace("\n", "", $validXml);
$cleanXml = substr($cleanXml, strlen("<pre>"), strlen($cleanXml));  
$cleanXml = substr($cleanXml, 0, strlen($cleanXml)-strlen("</pre>"));
hansmbakker
  • 1,108
  • 14
  • 29
1

In my case, I was able to run a replace on the html to remove the multiple empty lines and prevent Tidy from adding the breaks $html = preg_replace("/\n([\s]*)\n/", "\r\n", $html);