0

I've created a Hive UDF that parses a URL. The URL contains query parameters. When I parse the input in my UDF, however, characters like '=' and '&' are converted to gibberish.

Initially, I was relying on String's toString() method to convert the Hive Text to Java String. The above characters are converted to gibberish with this approach. I then tried using the new String(str, StandardCharsets.UTF_8) to convert the Hive Text to Java String. This worked at first. Then, it started producing gibberish as well.

My method is shown below. Any ideas on what I might not be doing right?

public Text evaluate(final Text requestInput, final Text referrerInput) {
    if (requestInput == null || referrerInput == null)
        return null;

    final String request = new String(requestInput.getBytes(), StandardCharsets.UTF_8); // converts '=' and '&' in URL strings to gibberish
    final String referrer = new String(referrerInput.getBytes(), StandardCharsets.UTF_8); // converts '=' and '&' in URL strings to gibberish

}

When I run HQL in Hive:

SELECT get_json_object(json, '$.base.request_url') FROM events

I get this:

GET /api/get_info?id=1465473313746 HTTP/1.1

In my UDF, the toString() method (no additional processing) produces the following output:

GET /api/get_info?id\u003d1465473313746 HTTP/1.1

okello
  • 601
  • 10
  • 27
  • why not just use parse_url or parse_url_tuple UDF ? Also, you are not saying what your UDF is supposed to be doing. Also, what you are doing wrong is converting Text to String. Text has a .toString method, you should use that. Instead, you're getting the bytes from the Text and creating a string forcing a UTF-8 encoding, but the Text may not actually be UTF-8. – Roberto Congiu Jun 19 '16 at 20:40
  • My theory was that the issue I've described above is caused by different character encodings. So, I'm trying to use an approach that allows me to specify the character encoding. I read somewhere that Hive always uses `UTF-8`. That's the reason for my specifying it. I'm selecting a JSON field, which happens to have entries containing web URLs. I want to manipulate this JSON in my UDF. However, on accessing it in my UDF, using `toString()` or the above approach, I get gibberish for `=` and `&` characters. – okello Jun 19 '16 at 21:08
  • There is no way to tell what's wrong without looking at the JSON and how the table was created. For sure, it's not the UDF's fault. Also, you say you get 'gibberish' as output but what looks gibberish to you may actually give a hint of what's wrong so you should show that too. – Roberto Congiu Jun 19 '16 at 21:32
  • can you please post sample input and output? so that i can try to help you. – Ranjith Sekar Jun 20 '16 at 07:22
  • I've updated the question with a sample Hive output and how the same is malformed in my UDF. Specifically, the `=` and `&` are malformed; the rest stays fine. – okello Jun 21 '16 at 08:14
  • I've learned that these are Unicode equivalents of these characters. I'm still researching on how I can prevent this from happening. In the mean time, any assistance will be appreciated. – okello Jun 21 '16 at 08:18
  • I found a solution. Having identified that these there was conversion to Unicode, it problem became relatively easy as I could now use Apache Commons' StringEscapeUtils to get a clean string. Thanks a lot, though, for your assistance. – okello Jun 21 '16 at 10:49

1 Answers1

0

I learned that the = and & were being converted to their Unicode equivalents. Why this was happening is still unclear to me. Using Apache Commons StringEscapeUtils utility, the problem became easier:

StringEscapeUtils.unescapeJava(requestInput.toString()) 

solved the issue.

okello
  • 601
  • 10
  • 27