1

I am using a html purifier package for purifying my rich text from any xss before storing in database.

But my rich text allows for Wiris symbols which uses special character as → or  .

Problem is the package does not allow me to escape these characters. It removes them completely. What should I do to escape them ??

Example of the string before purifying

<p><math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>x</mi><mn>2</mn></msup><mo>&#160;</mo><mo>+</mo><mo>&#160;</mo><mmultiscripts><mi>y</mi><mprescripts/><none/><mn>2</mn></mmultiscripts><mo>&#160;</mo><mover><mo>&#8594;</mo><mo>=</mo></mover><mo>&#160;</mo><msup><mi>z</mi><mn>2</mn></msup><mo>&#160;</mo></math></p>

After purifying

<p><math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>x</mi><mn>2</mn></msup><mo> </mo><mo>+</mo><mo> </mo><mmultiscripts><mi>y</mi><mprescripts></mprescripts><none><mn>2</mn></mmultiscripts><mo> </mo><mover><mo>→</mo><mo>=</mo></mover><mo> </mo><msup><mi>z</mi><mn>2</mn></msup><mo> </mo></math></p>
Omar Elewa
  • 353
  • 3
  • 14

2 Answers2

1

My guess is that these entities are failing the regexen that HTML Purifier is using to check for valid entities in HTMLPurifier_EntityParser, here:

         $this->_textEntitiesRegex =
             '/&(?:'.
             // hex
             '[#]x([a-fA-F0-9]+);?|'.
             // dec
             '[#]0*(\d+);?|'.
             // string (mandatory semicolon)
             // NB: order matters: match semicolon preferentially
             '([A-Za-z_:][A-Za-z0-9.\-_:]*);|'.
             // string (optional semicolon)
             "($semi_optional)".
             ')/';
 
         $this->_attrEntitiesRegex =
             '/&(?:'.
             // hex
             '[#]x([a-fA-F0-9]+);?|'.
             // dec
             '[#]0*(\d+);?|'.
             // string (mandatory semicolon)
             // NB: order matters: match semicolon preferentially
             '([A-Za-z_:][A-Za-z0-9.\-_:]*);|'.
             // string (optional semicolon)
             // don't match if trailing is equals or alphanumeric (URL
             // like)
             "($semi_optional)(?![=;A-Za-z0-9])".
             ')/';

Notice how it expects numeric entities to start with 0 currently. (Perfectly sane since it's designed to handle pure HTML, without add-ons, and to make that safe; but in your use-case, you want more entity flexibility.)

You could extend that class and overwrite the constructor (where these regexen are being defined, by instead defining your own where you remove the 0* from the // dec part of the regexen), instantiating that, try setting $this->_entity_parser on a Lexer created with HTMLPurifier_Lexer::create($config) to your instantiated EntityParser object (this is the part I am least sure about whether it would work; you might have to create a Lexer patch with extends as well), then supply the altered Lexer to the config using Core.LexerImpl.

I have no working proof-of-concept of these steps for you right now (especially in the context of Laravel), but you should be able to go through those motions in the purifier.php file, before the return.

pinkgothic
  • 6,081
  • 3
  • 47
  • 72
  • Thanks much for your answer, I found a simpler solution by just setting Core.EscapeNonASCIICharacters to true in my configurations file. – Omar Elewa Apr 16 '22 at 23:09
  • But I encountered anther problem ^--^ https://stackoverflow.com/questions/71898104/htmlpurifier-how-to-escape-self-closing-tags If you are able to give my any help, I would be very thankful. – Omar Elewa Apr 16 '22 at 23:12
1

I solved the problem by setting key Core.EscapeNonASCIICharacters to true

under my default key in my purifier.php file and the problem has gone.

Omar Elewa
  • 353
  • 3
  • 14