Canonicalize() function converts chars to white space

Question

I am using EncodeForHTML() to prevent Cross Site Scripting (XSS) attacks. In doing so, some text field as :

step 1:   cost too much to keep. #3&#4 bad business decision

is stored in the database as :

step 2:   cost too much to keep. &#xd;&#xa;&#x23;3&amp;&#x23;4 bad business decision

Then I use canonicalize to get back the original string :

 #canonicalize(fieldName, false, false ,true)#

which should return what was entered in step 1 above.

However, that &#4 becomes displayed as a white space character. It almost looks like a square. It happens for any &# followed by a single digit.

This is ColdFusion 2018. Any ideas on how to get back the default #3&#4 ?

What are you using for the `canonicalize` attribute when you use the `EncodeForHTML` function? — Dan Bracuk, Aug 14 '19 at 18:13
@DanBracuk The default value which is false. so the code looks like encodeForHTML(string). I tried putting in true, but that did not help. — CFNinja, Aug 14 '19 at 19:18
Why are you using `canonicalize()`? To reverse `encodeForHtml()`, use [`decodeForHtml()`](https://helpx.adobe.com/coldfusion/cfml-reference/coldfusion-functions/functions-c-d/DecodeForHTML.html). — Alex, Aug 14 '19 at 19:22
It seems to work for me. Can you post a snippet which demonstrates the issue? — Alex, Aug 14 '19 at 19:59
I quickly made a cffiddle: https://cffiddle.org/app/file?filepath=8fa70477-7cca-45fc-9b5d-3b9e810de19f/81bef57a-34c9-479d-9168-daabc0c33ad9/b6427314-bea3-4ca1-a65a-2c863bbf6375.cfm — Bernhard Döbler, Aug 14 '19 at 20:15
CF uses this library: https://github.com/ESAPI/esapi-java-legacy Maybe you can raise an issue there. — Bernhard Döbler, Aug 14 '19 at 20:34
Canonicalizing `` is guessing ``, so decoding this to EOT (control character, not printable) is working as intended. What do you even intend doing `I use canonicalize to get back the original string`? You have the "original string" before you are encoding it. There's no need to decode it back, and even then you need to decode the encoded value, not the unencoded one. — Alex, Aug 14 '19 at 20:41
@BernhardDöbler You are putting ## to make it work within cfset. The string is stored in the database. It is not the same input. — CFNinja, Aug 14 '19 at 21:11
As you say, I put ## to make it work in cfset. `canonicalize` still only sees one `#` — Bernhard Döbler, Aug 14 '19 at 21:14
@BernhardDöbler, the use case is not the same. Please see the updated question. — CFNinja, Aug 14 '19 at 21:16
@Alex this worked : decodeForHTML(encodeForHtml(myText))) If you want to post it as an answer, I will mark it as accepted. thanks — CFNinja, Aug 14 '19 at 21:45
@CFNinja I posed a comprehensive answer for you, making sure you never mix up encoding/decoding for HTML ever again. ;) — Alex, Aug 14 '19 at 22:20

Alex · Accepted Answer · 2019-08-14T22:24:12.220

Okay, let's go through this:

Encoding `#3&#4` for HTML

# becomes # _{(hex entity)}
3 becomes 3 _{(no encoding required)}
& becomes & _{(named entity)}
# becomes # _{(hex entity)}
4 becomes 4 _{(no encoding required)}

Note: 
 in your example is CarriageReturn and LineFeed, so basically there is a newline in front of #3&#4. We will ignore this for now.

Decoding `3&4` for HTML

Regardless if you use decodeForHtml() or canonicalize():

# becomes #
3 becomes 3
& becomes &
# becomes #
4 becomes 4

This is absolutely correct and there's no issue here. So...

Why am I seeing □?

It's simple: You are outputting the decoded value in HTML.

If you tell your browser to render #3&#4 as HTML, the browser will "smart-detect" an incomplete entity. Entities always start with &. This is why you are supposed to encode an actual ampersand as &, so the browser recognizes it as a literal character. Nowdays most browsers automatically detect a single/standalone & and will encode it accordingly. However, in your case, the browser assumes you meant to say  (abbr.  or ), which is the control character EOT and cannot be printed, resulting in a □.

The Solution

Whenever you want to display something in HTML, you have to encode the values. If you need to inspect a variable in ColdFusion, prefer <cfdump var="#value#"> (or writeDump(value)) over just outputting a value via <cfoutput>#value#</cfoutput> (or writeOutput(value)).

Demo

https://cffiddle.org/app/file?filepath=6926a59a-f639-4100-b802-07a17ff79c53/5d545e2c-01a4-4c13-9f50-eb15777fba8c/6307a84e-89a3-411d-874f-7d32bd9a9874.cfm

<cfset charsToEncode = [
    "##", <!--- we have to escape # in ColdFusion by doubling it --->
    "3",
    "&",
    "##", <!--- we have to escape # in ColdFusion by doubling it --->
    "4"
]>

<h2>encodeForHtml</h2>
<cfloop array="#charsToEncode#" index="char">
    <cfdump var="#encodeForHtml(char)#"><br>
</cfloop>

<cfset charsToDecode = [
    "&##x23;", <!--- we have to escape # in ColdFusion by doubling it --->
    "3",
    "&amp;",
    "&##x23;", <!--- we have to escape # in ColdFusion by doubling it --->
    "4"
]>

<h2>decodeForHtml</h2>
<cfloop array="#charsToDecode#" index="char">
    <cfdump var="#decodeForHtml(char)#"><br>
</cfloop>

<h2>canonicalize</h2>
<cfloop array="#charsToDecode#" index="char">
    <cfdump var="#canonicalize(char, false, false)#"><br>
</cfloop>

<h2>encoding the output PROPERLY</h2>
<cfoutput>#encodeForHtml("##3&##4")#</cfoutput><br>
<cfoutput>#encodeForHtml(decodeForHtml("&##x23;3&amp;&##x23;4"))#</cfoutput><br>
Note: due to the mix of entities, canonicalize() has to guess the begin/end of each entity and is having issues with the ampersand here:<br>
<cfoutput>#encodeForHtml(canonicalize("&##x23;3&##x26;&##x23;4", false, false))#</cfoutput><br>

<h2>encoding the output INCORRECTLY</h2>
#3&#4<br>
<cfoutput>#decodeForHtml("&##x23;3&amp;&##x23;4")#</cfoutput><br>
<cfoutput>#canonicalize("&##x23;3&amp;&##x23;4", false, false)#</cfoutput><br>

Slow applause . . . – Adrian J. Moreno Aug 15 '19 at 15:21 — Adrian J. Moreno, Aug 15 '19 at 15:21

Canonicalize() function converts chars to white space

1 Answers1

Encoding #3&#4 for HTML

Decoding &#23;3&amp;&#23;4 for HTML

Why am I seeing □?

The Solution

Demo

Encoding `#3&#4` for HTML

Decoding `3&4` for HTML