0

I am exporting a set of records to xml and then to xliff through xslt transformation. Export works fine but I am failing to convert some characters in export file. Here here some step by step detail:

Step 1. User inputs mix character string e.g. following string Autocomplete On' see the wrong character ==> í

Mysql db/table field encoding is set to utf8 e.g

  `unicode longtext COLLATE utf8_unicode_ci`

which stores the above text.

Step 2. A html snippet is generated for export purpose e.g

<html version="1.2">
    <table>
        <tr>
            <td id="Autocomplete_On">Autocomplete On' see the wrong character ==&#62; í</td>
        </tr>
    </table>
    </html>

Step 3. Convert to xml

  <?xml version="1.0" standalone="yes"?>
     <html version="1.2"><body><table><tr><td id="Autocomplete_On">
        Autocomplete On' see the wrong character ==&gt; &#xC3;&#xAD;</td>
</tr></table></body></html>

Step 4: Transform using xslt :

(pasted only desired portion of output, when viewed in browser I see this , while actual character is à in file )

 <body>
      <group id="id796986axmarkhtml-0">
        <group id="id533787bxmarkbody-1">
          <group id="id533788bxmarktable-2">
            <group id="id533790bxmarktr-3">
              <trans-unit id="td-4">
                <source>Autocomplete On' see the wrong character ==&gt; í</source>
                <target>Autocomplete On' see the wrong character ==&gt; í</target>
              </trans-unit>
            </group>
          </group>
        </group>
      </group>
    </body>

Actual Code :

  private function xml2xliff($htmlStr,$source,$target){
        $xml=new \DOMDocument();
        //hacky way to tidy html
        @$xml->loadHTML($htmlStr);//step 3
        $xsl = new \DOMDocument;
        $xsl->load(__DIR__.'/xliff/xsl/xml2xliff.xsl');
        $proc = new \XSLTProcessor();
        $proc->ImportStyleSheet($xsl);
        $proc->setParameter('', 'source', $this->getIsoName($source));
        $proc->setParameter('', 'target', $this->getIsoName($target));
        return $proc->transformToXML($xml); //step 4
    }

$htmlStr is html snippet generated in step 2,

So the issue is that the string is twice transformed. Actual character under consideration is

step 1. í

step 2. still í

step 3. converted to í i.e &#xC3;&#xAD;

step 4. converted to í

Another example:

input. Autocomplete On They’re gone now

xml output. Autocomplete On Theyâre gone now

Community
  • 1
  • 1
sakhunzai
  • 13,900
  • 23
  • 98
  • 159

1 Answers1

0

DOMDocument::loadHtml() loads your html as ANSI, but it is UTF-8. So the special character is split and destroyed. You can trick it into using UTF-8 with an XML processing instruction:

$html = <<<HTML
<html>
  <table>
    <tr>
      <td id="Autocomplete_On">Autocomplete On' see the wrong character ==&#62; í</td>
    </tr>
  </table>
</html>
HTML;

$dom = new DOMDocument('1.0', 'UTF-8');

$dom->loadHTML('<?xml encoding="UTF-8"?>'.$html);
var_dump(
  $dom->saveXml()
);

Output:

string(331) "<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<?xml encoding="UTF-8"??>
<html version="1.2"><body><table><tr><td id="Autocomplete_On">Autocomplete On' see the wrong character ==&gt; &#xED;</td>&#xD;
    </tr></table></body></html>
"
ThW
  • 19,120
  • 3
  • 22
  • 44