4

One of my REST APIs is expecting a property "url" which expects a URL as input from the user. I am using ESAPI to prevent from XSS attacks. The problem is that the user supplied URL is something like

http://example.com/alpha?abc=def&phil=key%3dbdj

The cannonicalize method from the ESAPI encoder throws intrusion exception here claiming that the input has mixed encoding, since it is url encoded and the piece '&phi' is treated as HTML encoded and thus the exception.

I had a similar problem with sanitizing one of my application urls where the second query parameter started with 'pa' or 'pi' and was converted to delta or pi characters by HTML decoding. Please refer to my previous Stackoverflow question here

Now since the problem is that since the entire URL is coming as input from the user, I cannot simply parse out the Query parameters and sanitize them individually, since malicious input can be created combining the two query parameters and sanitizing them individually wont work in that case.

Example: &ltscr comes is last part of first query param value and ipt&gtalert(0); or something comes as first part of the next query param control context.

Has anyone faced a similar problem? I would really like to know what solutions you guys implemented. Thanks for any pointers.

EDIT: The below answer from 'avgvstvs' does not throw the intrusion exception (Thanks!!). However, the cannonicalize method now changes the original input string. ESAPI treats &phi of the query param to be some html encoded char and replaces it to '?' char. Something like my previous question which is linked here. The difference being that was a URL of my application whereas this is user input. Is my only option maintaining a white list here?

Community
  • 1
  • 1
Nishant Nagwani
  • 1,160
  • 3
  • 13
  • 26
  • Possible duplicate of http://stackoverflow.com/questions/23267759/esapi-canonicalize-malforming-url?rq=1 – avgvstvs May 03 '14 at 18:30
  • I updated my answer to include the amendment to your question. It is mostly complete, but as noted I may have uncovered a small bug in ESAPI. – avgvstvs May 07 '14 at 12:55
  • Just for reference: here is a bug open with OWASP regarding the same issue. http://code.google.com/p/owasp-esapi-java/issues/detail?id=258 – Nishant Nagwani May 07 '14 at 18:47
  • After a few months of discussions with esapi developers, the problem is that a URI is a language in its own right. They suggested doing what I discussed here, using the standard Java classes to break up the URL, and then canonicalizing/validating the smaller parts of the URL in formats that aren't encoded into a the language of URLs. Canonicalize is mainly intended to be used for detection of double-encoding, and your query parameter `&phil=key%3dbdj` is *requiring* a doubly-encoded scheme. You need to do something custom here. – avgvstvs Feb 12 '15 at 18:23

1 Answers1

2

The problem that you're facing here, is that there are different rules for encoding different parts of a URL--to memory there's 4 sections in a URL that have different encoding rules. First, understand why in Java, you need to build URLs using the UriBuilder class. The URL specification will help with nitty-gritty details.

Now since the problem is that since the entire URL is coming as input from the user, I cannot simply parse out the Query parameters and sanitize them individually, since malicious input can be created combining the two query parameters and sanitizing them individually wont work in that case.

The only real option here is java.net.URI.

Try this:

URI dirtyURI = new URI("http://example.com/alpha?abc=def&phil=key%3dbdj");

String cleanURIStr = enc.canonicalize( dirtyURI.getPath() );

The call to URI.getPath() should give you a non-percent encoded URL, and if enc.canonicalize() detects double-encoding after that stage then you really DO have a double-encoded string and should inform the caller that you will only accept single-encoded URL strings. The URI.getPath() is smart enough to use decoding rules for each part of the URL string.

If its still giving you some trouble, the API reference has other methods that will extract other parts of the URL, in the event that you need to do different things with different parts of the URL. IF you ever need to manually parse parameters on a GET request for example, you can actually just have it return the query string itself--and it will have done a decoding pass on it.

=============JUNIT Test Case============

package org.owasp.esapi;

import java.net.URI;
import java.net.URISyntaxException;

import org.junit.Test;

public class TestURLValidation {

    @Test
    public void test() throws URISyntaxException {
        Encoder enc = ESAPI.encoder();
        String input = "http://example.com/alpha?abc=def&phil=key%3dbdj";
        URI dirtyURI = new URI(input);
        enc.canonicalize(dirtyURI.getQuery());
        
    }

}

=================Answer for updated question=====================

There's no way around it: Encoder.canonicalize() is intended to reduce escaped character sequences into their reduced, native-to-Java form. URLs are most likely considered a special case so they were most likely deliberately excluded from consideration. Here's the way I would handle your case--without a whitelist, and it will guarantee that you are protected by Encoder.canonicalize().

Use the code above to get a URI representation of your input.

Step 1: Canonicalize all of the URI parts except URI.getQuery() Step 2: Use a library parser to parse the query string into a data structure. I would use httpclient-4.3.3.jar and httpcore-4.3.3.jar from commons. You'll then do something like this:

import java.net.URI;
import java.net.URISyntaxException;
import java.util.Iterator;
import java.util.List;

import javax.ws.rs.core.UriBuilder;

import org.apache.http.client.utils.URLEncodedUtils;
import org.junit.Test;
import org.owasp.esapi.ESAPI;
import org.owasp.esapi.Encoder;

public class TestURLValidation
{

  @Test
  public void test() throws URISyntaxException {
    Encoder enc = ESAPI.encoder();
    String input = "http://example.com/alpha?abc=def&phil=key%3dbdj";
    URI dirtyURI = new URI(input);
    UriBuilder uriData = UriBuilder.fromUri(enc.canonicalize(dirtyURI.getScheme()));
    uriData.path(enc.canonicalize(enc.canonicalize(dirtyURI.getAuthority() + dirtyURI.getPath())));
    println(uriData.build().toString());
    List<org.apache.http.NameValuePair> params = URLEncodedUtils.parse(dirtyURI, "UTF-8");
    Iterator<org.apache.http.NameValuePair> it = params.iterator();
    while(it.hasNext()) {
      org.apache.http.NameValuePair nValuePair = it.next();
      uriData.queryParam(enc.canonicalize(nValuePair.getName()), enc.canonicalize(nValuePair.getValue()));
    }
    String canonicalizedUrl = uriData.build().toString();
    println(canonicalizedUrl);
  }

  public static void println(String s) {
    System.out.println(s);
  }
  
}

What we're really doing here is using standard libraries to parse the inputURL (thus taking all the burden off of us) and then canonicalizing the parts after we've parsed each section.

Please note that the code I've listed won't work for all url types... there are more parts to a URL than scheme/authority/path/queries. (Missing is the possibility of userInfo or port, if you need those, modify this code accordingly.)

Community
  • 1
  • 1
avgvstvs
  • 6,196
  • 6
  • 43
  • 74
  • Thanks for your response. However, this doesn't solve the problem since when cannonicalizing the query string part, the encoder cannonicalizes the second query string param '&phi' to the char '?' because it probably thinks its an HTML encoding. Is some kind of whitelist for false positives is my only option here? – Nishant Nagwani May 05 '14 at 15:49
  • I added a JUnit test case. `encoder.canonicalize()` throws no exceptions based upon the input string you provided in your original question when using the URI class to parse the input string. – avgvstvs May 05 '14 at 19:12
  • I agree that it does not throw the exception, and your suggestion probably works for my original ( and badly worded :( ) question. However, since its changing the input string, this solution is not helping me much. I just wanted to know your suggestion about my next question, whether creating a whitelist is my only option? Thanks for your help again. – Nishant Nagwani May 05 '14 at 19:20
  • That depends, how much risk is there that the URL will be displayed in a browser, or used as input to a database? – avgvstvs May 05 '14 at 19:45
  • Lets assume both. the url would probably be saved in the database and would be displayed on a GET request from a browser. Since its modifying the original input, I cannot store it in the database after that. – Nishant Nagwani May 05 '14 at 20:05
  • @Nishant Nagwani, It's starting to sound like you have a requirement that is *forcing* you to take doubly-encoded input. `phil=key%3dbdj` gets converted to `abc=def&phil=key=bdj` before we pass to `canonicalize`. I don't see why you have to worry about this case--the `=` character is in the legal ASCII range, and we *don't* have to escape ASCII characters in order to be compliant with the URL specifications. – avgvstvs May 05 '14 at 20:37
  • Actually, I do not completely agree with your last point. The character '=' is fine. The problem is the second query parameter. The HTML encoding for the character phi is φ However due to previous IE versions (I am not 100% certain), HTML codec, which is registered to the default ESAPI encoder, treats &phi, φ, and the greek letter phi as the same, and thinks that the input URL contains malicious HTML ('&phi'). If my query param was not &phi and was &alpha=1234, the ESAPI encoder won't take any chances and would modify the input assuming there's HTML in there. Hope this makes sense. – Nishant Nagwani May 05 '14 at 20:52
  • Now that I took some time to play with this in JUnit, I see exactly what the issue is. I don't think canonicalize was designed with URLs as input at all. One workaround maybe would be to split the query string up on `&`, canonicalize the contents, and then reconstruct the query String. I have to study for a final but hopefully that will get you somewhere in the meantime. – avgvstvs May 05 '14 at 21:49
  • Here's a fast workaround: `String clean = enc.canonicalize(query.replaceAll("&", new String(Character.toChars(0x1410))));` – avgvstvs May 05 '14 at 21:57
  • After canonicalization you replace the `0x1410` character with "&" again and then do what you need to do with it. – avgvstvs May 05 '14 at 21:58
  • I did consider splitting the URL, I handle it that way when its my application URL. However, here it seemed like a bad idea because even the query param (including its value) is provided by the user, which is untrusted input. I tried to give an example to this problem, in the question above. I think its a flaw in ESAPIs design that it does not handle this properly. This could easily be an input from the user for some other field and not just an url. Thanks again for your help and good luck with your Final. – Nishant Nagwani May 05 '14 at 22:01
  • Hi, I saw the update on your answer. However I have one question. I noticed that you are basically splitting the URL, sanitizing the parts individually, and then combining the parts to form a clean URL. I am not sure how would this code handle a handsomely crafted malicious input, where a part of the malicious input is in the value part of first NV pair, and the second part of malicious string is in the Name part of second NV pair. The sanitization would kind of fail because it wont be treated as malicious by itself. Does my question make sense? – Nishant Nagwani May 07 '14 at 18:44
  • Also, here is a bug open with OWASP regarding the same issue. I dont think though that they would fix this in the near future. https://code.google.com/p/owasp-esapi-java/issues/detail?id=258 – Nishant Nagwani May 07 '14 at 18:45
  • In *theory* I'm splitting it, but I'm relying upon professional-grade parsers to do it. Both `UriBuilder` and the `URI` class are built to parse/build URLs according to RFC 2396 and RFC 2732. – avgvstvs May 08 '14 at 11:48
  • And for the record, `canonicalize` does resolve the `%3` correctly to `=`, so there's no bug involved there. The `UriBuilder` puts it back. – avgvstvs May 08 '14 at 11:48
  • And I recently became a contributor to the ESAPI project, I'll make that bug the next one I tackle. – avgvstvs May 08 '14 at 11:51
  • Commits implementing a fix of this kind were just approved by OWASP: https://github.com/ESAPI/esapi-java-legacy/commit/aafa3539002e01cc40b7d12868ab043838e62563 – avgvstvs Jun 28 '16 at 16:47