I have a large number of HTML (and possibly other xml) documents that I need to redact.
The redactions are typically of the form "John Doe" -> "[Person A]". The text to be redacted may be in headers or paragraphs, but will almost always be in paragraphs.
Simple string substitutions really. Not very complicated things.
However, I do want to preserve document structure, and I would prefer to not reinvent any wheels. String substitution in the document text may do the job, but also may break document structure, so it will be a last option.
Right now I have stared at XSLT for an hour and tried to force "str:replace" to do my bidding. I will spare you from viewing me feeble attempts that didn't work, but I will ask this: Is there a simple and know way to apply my redactions using XSLT, and could you post it here?
Thank you in advance.
Update: at the request of Martin Honnen I'm adding my input files, as well as the command I used to get the latest error message. From this it will be apparent that I'm a complete n00b when it comes to XSLT :-)
.html file:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <html> <head> <meta http-equiv="content-type" content="text/html; charset=utf-8"/> <title>TodaysDate</title> <meta name="created" content="2020-11-04T30:45:00"/> </head> <body> <ol start="2"> <li><p> John Doe on 9. fux 2057 together with Henry Fluebottom formed the company Doe &; Fluebottom Widgets Inc. </p> </ol> </body> </html>
The XSLT transformation file:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>
<xsl:template match="p">
<xsl:copy>
<xsl:attribute name="matchesPattern">
<xsl:copy-of select='str:replace("John Doe", ".*", "[Person A]")'/>
</xsl:attribute>
<xsl:copy-of select='str:replace("Henry Fluebottom", ".*", "[Person B]")'/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
The command and the output:
$ xsltproc -html transform.xsl example.html
xmlXPathCompOpEval: function replace bound to undefined prefix str
xmlXPathCompiledEval: 2 objects left on the stack.
<?xml version="1.0"?>
TodaysDate
<p matchesPattern=""/>
$
John Doe is sick
`) but in mixed contents or spread across several elements like `John Doe is sick.
`). Thus, at least show us a representative, small input and output sample, even if you feel your coding attempt failed. But you could show one also and tell us how it failed exactly. – Martin Honnen Mar 23 '20 at 11:01