2

We're cleaning up some errors on our site after migration from ruby 1.8.7 to 1.9.3, Rails 3.2.12. We have one encoding error left -- Bing is sending requests for URLs in the form

/search?q=author:\"Andr\xc3\xa1s%20Guttman\"

(This reads /search?q=author:"András Guttman", where the á is escaped).

In fairness to Bing, we were the ones that gave them those bogus URLs, but ruby 1.9.3 isn't happy with them any more.

Our server is currently returning a 500. Rails is returning the error "Encoding::CompatibilityError: incompatible character encodings: UTF-8 and ASCII-8BIT"

I am unable to reproduce this error in a browser, or via curl or wget from OS X or Linux command line.

I want to send a 301 redirect back with a properly encoded URL.

I am guessing that I want to:

  1. detect that the URL has old UTF-8 then if it is malformed, only
  2. use String#encode to get from old to new UTF-8
  3. use CGI.escape() to %-encode the URL
  4. 301 redirect to the corrected URL

So I have read a lot and am not sure how (or if) I can detect this bogus URL. I need to detect because otherwise I would have to 301 everything!

When I try in irb I get these results:

  • 1.9.3p392 :015 > foo = "/search?q=author:\"Andr\xc3\xa1s%20Guttman\""
  • => "/search?q=author:\"András%20Guttman\""
  • 1.9.3p392 :016 > "/search?q=author:\"Andr\xc3\xa1s%20Guttman\"".encoding
  • => #<Encoding:UTF-8>
  • 1.9.3p392 :017 > foo.encoding
  • => #<Encoding:UTF-8>

I have read this SO post but I am not sure if I have to go this far or even if this applies.

[Update: since posting, we have added a call to the code in the SO post linked above prior to all requests.]

So the question is: how can I detect the old-style encoding so that I can do the other steps.

Community
  • 1
  • 1
Tom Harrison
  • 13,533
  • 3
  • 49
  • 77
  • Can you specify what version of Rails you are using? – tilthouse Jun 25 '13 at 17:16
  • updated post (3.2.12) – Tom Harrison Jun 25 '13 at 17:56
  • 1
    "I am unable to reproduce this error in a browser" If you can't reproduce it then there must be something funny happening to stop you being able to reproduce it. Please can you post a complete request path from Bing if you had snipped down the example to what you thought was the relevant bit. – Danack Jun 28 '13 at 02:21
  • As far as we can tell from logs, exceptions, and so on the URL request path and query string coming in to our domain appears as `/search?q=author:"András%20Guttman"`. I have tried to isolate out various transformations browsers (and even terminal windows) make by sending this request via `curl` and `wget`, but still cannot reproduce. – Tom Harrison Jul 03 '13 at 02:52
  • I was having the same problem with Bing causing 500's on our search urls. I was [eventually](http://stackoverflow.com/q/20153441/305019) able to reproduce using this curl command: `curl 'http://rails.host.com:3000/?x=✓'` - it seems to raise this error on *any* url. – gingerlime Nov 22 '13 at 22:35

2 Answers2

1

First, let's look at the string manipulation side of things. It looks to like using the URI module and unescaping then re-escaping will just work:

2.0.0p0 :007 > foo = "/search?q=author:\"Andr\xc3\xa1s%20Guttman\""
=> "/search?q=author:\"András%20Guttman\""
2.0.0p0 :008 > URI.unescape foo
=> "/search?q=author:\"András Guttman\""
2.0.0p0 :009 > URI.escape URI.unescape foo
=> "/search?q=author:%22Andr%C3%A1s%20Guttman%22"

So the next question is where to do that? I'd say the problem with trying to detect string with the \x escape character is that you can't GUARANTEE those strings were not supposed to be slash-x versus escaped (although, in practice, maybe that is an okay assumption).

You might consider just adding a small rack middleware that does this. See this Railscast for more on rack. Assuming you only get these in the parameters (i.e., after the ? in the URL), then your middleware would look something like (untested, just for illustration; place in your /lib folder as reescape_parameters.rb):

require 'uri' # possibly not needed?

class ReescapeParameters
  def initialize(app)
    @app = app
  end

  def call(env)
    env['QUERY_STRING'] = URI.escape URI.unescape env['QUERY_STRING']
    status, headers, body = @app.call(env)
    [status, headers, body]
  end
end

Then you use the middleware by adding a line to your application config or an initializer. For example, in /config/application.rb (or, alternatively, in an initializer):

config.middleware.use "ReescapeParameters"

Note that you will probably need to catch theme parameters before any parameter handling by Rails. I'm not sure where in the Rack stack you'll need to put it, but you will more likely need:

config.middleware.insert_before ActionDispatch::ParamsParser, ReescapeParameters

Which would put it in the stack before ActionDispatch::ParamsParser. You'll need to figure out the correct module to put it after. This is just a guess. (FYI: There is an insert_after as well.)

UPDATE (REVISED)

If you MUST detect these and then send a 301, you could try:

  def call(env)
    if env['QUERY_STRING'].encoding.name == 'ASCII-8BIT'  # could be 'ASCII_8BIT' ?
      location = URI.escape URI.unescape env['QUERY_STRING']
      [301, {'Content-Type' => 'text','Location' => location}, '']
    else
      status, headers, body = @app.call(env)
      [status, headers, body]
    end
  end

This is a trial -- it might match everything. But hopefully, "regular" strings are being encoded as something else (and hence you only get the error for the ASCII-8BIT encoding).

Per one of the comments, you could also convert instead of unescape and escape:

location = env['QUERY_STRING'].encode('UTF-8')

but you might still need to URI escape the resulting string anyway (not sure, depends on your circumstances).

tilthouse
  • 425
  • 2
  • 10
  • Thanks for the try. The question is how can I *detect* the presence of these strings (so that I can redirect to BingBot with a correctly escaped URL). – Tom Harrison Jun 25 '13 at 18:07
  • We'll give it a try in the next day or two and let you know. Thanks! – Tom Harrison Jun 26 '13 at 04:00
  • Can't you check the encoding to see if it is ASCII-8BIT, convert and send the new URL? – Pedro Nascimento Jun 30 '13 at 06:42
  • Sorry, the test `env['QUERY_STRING'] =~ /\\x/` in a rack middleware class & config as specified did not detect any difference in failing cases as in working cases. I put in this and several variants, and additional tests for encoding, but was still getting the error when BingBot hit us and no clues. Thanks for the attempt. – Tom Harrison Jul 03 '13 at 02:40
  • So, Pedro Nascimento (in the comments above) is right. However, I think that since you actually probably DO want to URI.escape the output you might as well just do that. That happens to make the encoding issue moot (the output will always be UTF-8). – tilthouse Jul 03 '13 at 17:27
  • I'm going to modify code above. Please let me know if it works. Depends on if ALL inputs are ASCII-8BIT, or only the ones with the escaped characters. – tilthouse Jul 03 '13 at 17:28
  • Have you had a chance to try the newly revised version out? – tilthouse Jul 11 '13 at 16:57
-1

Please use CGI::unescapeHTML(string)

mlibby
  • 6,567
  • 1
  • 32
  • 41
akbarbin
  • 4,985
  • 1
  • 28
  • 31