4

One of the columns in my file is url encoded, I have to decode that column and need to perform some operations based on values inside the column. Is there any way I can decode that column in awk?

Walter Tross
  • 12,237
  • 2
  • 40
  • 64
MikA
  • 5,184
  • 5
  • 33
  • 42

1 Answers1

6

You have to adapt it depending your file format, but the basic principle is here (tested with GNU Awk 3.1.7):

sh$ echo 'Hello%2C%20world%20%21' | awk '
     {
         for (i = 0x20; i < 0x40; ++i) {
             repl = sprintf("%c", i);
             if ((repl == "&") || (repl == "\\"))
                 repl = "\\" repl;
             gsub(sprintf("%%%02X", i), repl);
             gsub(sprintf("%%%02x", i), repl);
         }
         print
     }
 '
Hello, world !

If you have gawk, you can wrap that in a function (credit to brendanh in a comment below):

function urlDecode(url) {
    for (i = 0x20; i < 0x40; ++i) {
        repl = sprintf("%c", i);
        if ((repl == "&") || (repl == "\\")) {
            repl = "\\" repl;
        }
        url = gensub(sprintf("%%%02X", i), repl, "g", url);
        url = gensub(sprintf("%%%02x", i), repl, "g", url);
    }
    return url;
}
Community
  • 1
  • 1
Sylvain Leroux
  • 50,096
  • 7
  • 103
  • 125
  • My string is like this: 'http%3a%2f%2fwww.gazelle.com%2fiphone%2fiphone-3g' the above operation couldn't decode this string..:( – MikA Jun 08 '13 at 19:48
  • Obviously, I used the format '%02X' which match URL encoded with percent-sign in _uppercase_ like `http%3A%2F...` I modified the sample code to convert lower-case percent-encoding too. Now it should works with both ... at least up to `%40` (upper limit of the for loop). You might have to adjust that... – Sylvain Leroux Jun 08 '13 at 20:03
  • My String is like this: 1370474740&http%3a%2f%2fwww.xxxx.com%2fiphone%2fiphone-3g&et%3da%26ago%3d212%26ao%3d219%26px%3d73%26av1%3d2%26av2%3dOrganicSearch&13456 when i use awk like this: awk 'BEGIN {FS = "&"} {for (i = 0x20; i < 0x40; ++i) gsub(sprintf("%%%02x", i), sprintf("%c", i));print $1,$2,$3}' '%26' which is '&' is not getting converted, why? – MikA Jun 08 '13 at 20:33
  • This one was tough! I wasn't remembering that `&` and `\ ` have special meaning in the replacement string for `gsub`. It is fixed in the answer (I hope) – Sylvain Leroux Jun 08 '13 at 21:26
  • This worked, but the decoded '&'s were are also considered as FS. – MikA Jun 09 '13 at 14:11
  • FWIW a slightly modified gawk-only version of the answer, as a function: ``` function urlDecode(url) { for (i = 0x20; i < 0x40; ++i) { repl = sprintf("%c", i); if ((repl == "&") || (repl == "\\")) { repl = "\\" repl; } url = gensub(sprintf("%%%02X", i), repl, "g", url); url = gensub(sprintf("%%%02x", i), repl, "g", url); } return url; } ``` – brendanh Nov 15 '14 at 23:15
  • @brendanh I took the liberty to add your function in my answer. If you do not agree with that, please feel free to revert that edit. – Sylvain Leroux Nov 16 '14 at 10:43
  • 1
    While this function works, it's quite slow, i found a much much faster one here https://github.com/Knorkebrot/werc/blob/master/bin/contrib/urldecode.awk – Ruslan Talpa Jan 14 '15 at 08:24