8

I'm pulling data from remote json at http://hndroidapi.appspot.com/news/format/json/page/?appid=test . The problem I'm running into is that this API appears to be building the JSON without correctly handling UTF-8 encoding (correct me if I'm wrong here). For example, part of the result that gets passed right now is

{
"title":"IPad - please don€™t ding while you and I are asleep  ",
"url":"http://modern-products.tumblr.com/post/25384729998/ipad-please-dont-ding-while-you-and-i-are-asleep",
"score":"10 points",
"user":"roee",
"comments":"18 comments",
"time":"1 hour ago",
"item_id":"4128497",
"description":"10 points by roee 1 hour ago  | 18 comments"
}

Notice the don€™t. And that isn't the only type of character it is choking on. Is there anything I can do to convert the data into something clean, given that I don't control the API?

Edit:

Here is how I'm pulling down the JSON:

hn_url = "http://hndroidapi.appspot.com/news/format/json/page/?appid=test"
  url = URI.parse(hn_url)

  # Attempt to get the json
  req = Net::HTTP::Get.new(hn_url)
  req.add_field('User-Agent', 'Test')
  res = Net::HTTP.start(url.host, url.port) {|http| http.request(req) }
  response = res.body
  if response.nil?
    puts "Bad response when fetching HN json"
    return
  end

  # Attempt to parse the json
  result = JSON.parse(response)
  if result.nil?
    puts "Error parsing HN json"
    return
  end

Edit 2:

Just found the API's GitHub page. Looks like this is an outstanding issue. Still not sure if there's any workarounds that I can do from my end: https://github.com/glebpopov/Hacker-News-Droid-API/issues/4

hodgesmr
  • 2,765
  • 7
  • 30
  • 41
  • It looks like the JSON response body you are receiving may include HTML-safe symbols. I don't see any bad characters at a glance, and I do see that the response `Content-Type` header is set to `application/json; charset=utf-8`, which looks correct. How are you getting the response body? I would try examining the response with a browser tool like `Dev HTTP Client` or CURL, and see if what your application is getting differs from the actual response. If so, you may be handling it incorrectly in your code. – fdsaas Jun 18 '12 at 22:27
  • Thanks. I added my code up in the edit. The issue, though, is that they are HTML-safe symbols. But, it shouldn't be a Euro Symbol and a 'tm' symbol. It should be an apostrophe. – hodgesmr Jun 18 '12 at 22:35
  • You can see the exact response in the console by using `puts res.body`. Are you seeing the funky symbols later via the `result` object? – fdsaas Jun 18 '12 at 22:43
  • I'm not seeing the funky symbols, I'm seeing the HTML-safe versions of them. So, in the example above, I'm actually seeing `€™` where I should be seeing an apostrophe. This is in both res.body and later on in the result object. I think this is because the API is not representing the apostrophe correctly. So, I was hoping to compensate somehow. – hodgesmr Jun 18 '12 at 22:51
  • 1
    Ah, WTF-8 with HTML entity escapes, haven't seen that one before. I feel your pain. – Lars Haugseth Jun 18 '12 at 22:55

2 Answers2

5

It looks like the JSON response body you are receiving is being received in US-ASCII instead of UTF-8 because Net::HTTP purposely doesn't force encoding.

1.9.3p194 :044 > puts res.body.encoding
US-ASCII

In Ruby 1.9.3, you can force the encoding if you know what it's supposed to be. Try this:

response = res.body.force_encoding('UTF-8')

The JSON parser should then handle the UTF-8 the way you want it to.

References

fdsaas
  • 714
  • 4
  • 10
4

Using force_encoding seems like the best solution. Following-up to Kevin Dickerson's answer, here's an explanation of the weirdness.

Net::HTTP is sort of a mess.

On 1.9.3:

  • If the server sends a chunked response, you'd always get ASCII-8BIT. This seems to take precedence over the other scenarios.
  • If you call http.request with a Get object, you'd get US-ASCII. This method does not do compression for you.
  • If you call http.get, compression is enabled.
    • if the server supports compression, you'd get ASCII-8BIT
    • if the server doesn't send a compressed body, you'd get US-ASCII

You'd get US-ASCII because when Net::HTTP creates the buffer string to receive the response, it's created in the interpreter's default source file encoding, which is US-ASCII. (The net/ source files, don't have the magic encoding comment at the top, so they use ruby's default.)

The decompression produces ASCII-8BIT because it's hardcoded to do that in the get method when decompressing.

On 2.0, it seems like you always gets UTF-8 back, but this is because that's the default source-file encoding. If you change it via the -K option, the response encoding would change accordingly. Try passing n, e, s, u to -K.

Kelvin
  • 20,119
  • 3
  • 60
  • 68