PHP Huffman Decode Algorithm

Question

I applied for a job recently and got sent a hackerrank exam with a couple of questions.One of them was a huffman decoding algorithm. There is a similar problem available here which explains the formatting alot better then I can.

The actual task was to take two arguments and return the decoded string.

The first argument is the codes, which is a string array like:

[
    "a      00",
    "b      101",
    "c      0111",
    "[newline]      1001"
]

Which is like: single character, two tabs, huffman code.

The newline was specified as being in this format due to the way that hacker rank is set up.

The second argument is a string to decode using the codes. For example:

101000111 = bac

This is my solution:

function decode($codes, $encoded) {
    $returnString = '';
    $codeArray = array();

    foreach($codes as $code) {
        sscanf($code, "%s\t\t%s", $letter, $code);
        if ($letter == "[newline]")
            $letter = "\n";
        $codeArray[$code] = $letter;
    }
    print_r($codeArray);

    $numbers = str_split($encoded);
    $searchCode = '';
    foreach ($numbers as $number) {
        $searchCode .= $number;
        if (isset($codeArray[$searchCode])) {
            $returnString .= $codeArray[$searchCode];
            $searchCode = '';
        }
    }

    return $returnString;
}

It passed the two initial tests but there were another five hidden tests which it did not pass and gave no feedback on.

I realize that this solution would not pass if the character was a white space so I tried a less optimal solution that used substr to get the first character and regex matching to get the number but this still passed the first two and failed the hidden five. I tried function in the hacker rank platform with white-space as input and the sandboxed environment could not handle it anyway so I reverted to the above solution as it was more elegant.

I tried the code with special characters, characters from other languages, codes of various sizes and it always returned the desired solution.

I am just frustrated that I could not find the cases that caused this to fail as I found this to be an elegant solution. I would love some feedback both on why this could fail given that there is no white-space and also any feedback on performance increases.

Did they give you any restrictions on what built-in functionality you are allowed to use? If not, I would build one array with the codes, and one with the replacements - and then simply use str_replace, that will do all the work in one go. Only requirement is that you sort the codes by length in descending order first. — CBroe, Sep 21 '17 at 09:36
Any inbuilt PHP functions are allowed, and that is an interesting solution, i don't think that would work well though as there could be unintended matches where the numbers match up to a letter in an unintended location. The codes can be variable length so you couldn't split them up first either. — Kieran, Sep 21 '17 at 09:43
The code looks fine to me. But: (1) Could there be any other such codes like `[newline]`, which you would need to take care of? (2) Are there time limits you have to adhere to (for very large input)? (3) Does the function have to give a specific output when the encoded string is found to be an invalid sequence? — trincot, Sep 21 '17 at 09:46
(4) Could the input string have multibyte characters? If so, you need to use something else than `str_split`, as that will split the string into bytes, not characters. — trincot, Sep 21 '17 at 09:49
Thanks, I was sure it was really nice code and it made me question the environment because it wouldn't support whitespace. I just didn't want to sound crazy blaming the environment. 1) The newline was explicitly mentioned as the only case. 2) There are timeout limitations but it did not reach then, it was a specific failed case. 3) There was no mention of what should happen if it was invalid 4) That is a possibility but the idea of huffman compression is that it is binary so should be just 0 or 1 in the code string — Kieran, Sep 21 '17 at 09:50
Indeed for (4) it would only be an issue in the encoding phase, not decoding. — trincot, Sep 21 '17 at 09:59
_“as there could be unintended matches where the numbers match up to a letter in an unintended location”_ - no, that won’t happen, if you order them by descending key length. This is “built-in” into the algorithm already, that there can be no ambiguity then. — CBroe, Sep 21 '17 at 10:31
@CBroe, I think you are wrong. Take the coding in the question, and input string 10100101. The longest match would be 1001, but it would be wrong to replace it. The only correct decoding is 101-00-101, not 10-1001-01. — trincot, Sep 21 '17 at 11:14
Back to (4) again: are you sure that the codes are always expressed in "0" and "1"? Could it be "x" and "y" instead? Could it be multibyte characters? — trincot, Sep 21 '17 at 11:19
@trincot: _“Take the coding in the question, and input string 10100101”_ - but the encoded string in this example wasn’t this one, but `101000111`. You can not just arbitrarily change that. What codes you get with Huffman, depends on the input data to encode that you fed in to begin with. — CBroe, Sep 21 '17 at 11:19
Sure, I was referring to the question's encoding *scheme*, not to the input value for `$encoded`. The test suit would of course present other values for `$encoded`, and would only provide values that were the result of an encoding. But my example shows that you cannot just start replacing substrings in the middle of the encoded input. You must start from the left, working to the right. — trincot, Sep 21 '17 at 11:23
@trincot regarding 4, that could well be the case, it gave the examples as binary but the cases where it was failing were all hidden, so I had no visibility about why, so that could be the case. — Kieran, Sep 21 '17 at 11:31
@CBroe, here is a case that definitely shows you cannot start with replacing the longest substring. Take encoding scheme: `a=01,b=10,c=110`. This means the string `ab` would be encoded to `0110`. If you would decode it by first replacing the longest string, you would identify a `c` in there, which obviously is not correct. — trincot, Sep 21 '17 at 11:37
@Kieran, you could cover the potential multibyte issue by replacing the `str_split` assignment with `= preg_split('//u', $encoded, -1, PREG_SPLIT_NO_EMPTY);` — trincot, Sep 21 '17 at 11:42
Your concern about whitespace is valid as it would immediately match with `\t` in the `sscanf` pattern. I would still suggest to circumvent that and use this instead: `list($letter, $code) = explode("\t\t", $code);` — trincot, Sep 21 '17 at 12:00
@trincot again, this is not how it works. You don’t _start_ with a given encoding scheme - the scheme is created _based on_ the specific content that you are trying to encode. — CBroe, Sep 21 '17 at 12:17
@CBroe, yes, but in the OP's case, one *does* get a given encoding scheme. The scheme might have been created based on unknown input -- we don't know. Fact is that we get the scheme, and that the algorithm must be able to process that scheme, for any given encoded string. — trincot, Sep 21 '17 at 12:22
@trincot I do really like your suggestion with list($letter, $code) = explode("\t\t", $code); I will make a mental note of that, its a pretty nice way of splitting the code and that would have been that slight bit more reliable. — Kieran, Sep 21 '17 at 12:27
@trincot I think you got this the wrong way around here. It’s not that you have been given a scheme as some sort of “key”, and now use that to encode arbitrary texts. No, the scheme came into being _because_ the encoding algorithm was applied to a _specific input text_. Your example, _"a=01,b=10,c=110. This means the string ab would be encoded to 0110"_ is simply wrong, in that it is not even applicable - if one was to go and encode `ab` in Huffman, there would not be a `c` in the resulting scheme. — CBroe, Sep 21 '17 at 12:58
@CBroe, you seem to assume the encoded input string was the basis for creating the encoding scheme. I don't think you can assume that. Nothing is said about which input led to the scheme. And for the missing `c` argument, you can extend the input string with an encoded `c`: `abc` is encoded as 0110110. The problem remains the same. Anyway, I rest my case. The comments section is not intended for prolonged discussions like this. — trincot, Sep 21 '17 at 13:05
@trincot Yes, you can assume that, because that is how Huffman works - you create the scheme based on your input text. An integral part of Huffman is that the input text is analyzed to get the number of occurrences of the individual letters (or words, depending on what you want to apply it to), so that those that occur the most can get the shortest codes assigned - remember, Huffman is not an encryption algorithm, but a lossless compression algorithm. […] — CBroe, Sep 21 '17 at 14:03
[…] You are creating a scheme that is specific to the input text. If you have a different input text, you need to first of all perform this first step - counting the “parts” - again. Using the same scheme on a different text would make no sense whatsoever. — CBroe, Sep 21 '17 at 14:03
wondering if there are any updates to this? I encountered the same problem on hackerrank and was clueless to this as well. I had the same solution and only got 2/7 test cases. — gerky, Mar 05 '18 at 03:58
I'm thinking if there are any limitations to the algorithm above that I missed.. — gerky, Mar 05 '18 at 04:07
@maru if you use ```var_dump``` in place of ```print_r```, you will see that the array keys become mixed strings and numbers (if a code starts with 0 or longer than 10 digits, it remains to be the string as parsed, otherwise it becomes a number). While I have not managed to breed codes which would cause erroneous/ambiguous behavior because of this, it might be a direction to explore — tevemadar, Mar 11 '18 at 13:27
Did the company/hackerrank mention if the test cases had any issues? — Avi_B, Apr 07 '18 at 01:09

score 1 · Answer 1 · answered Mar 06 '18 at 04:33

Your basic approach is sound. Since a Huffman code is a prefix code, i.e. no code is a prefix of another, then if your search finds a match, then that must be the code. The second half of your code would work with any proper Huffman code and any message encoded using it.

Some comments. First, the example you provide is not a Huffman code, since the prefixes 010, 0110, 1000, and 11 are not present. Huffman codes are complete, whereas this prefix code is not.

This brings up a second issue, which is that you do not detect this error. You should be checking to see if $searchCode is empty after the end of your loop. If it is not, then the code was not complete, or a code ended in the middle. Either way, the message is corrupt with respect to the provided prefix code. Did the question specify what to do with errors?

The only real issue I would expect with this code is that you did not decode the code description generally enough. Did the question say there were always two tabs, or did you conclude that? Perhaps it was just any amount of space and tabs. Where there other character encodings you neeed to convert like [newline]? I presume you in fact did need to convert them, if one of the examples that worked contained one. Did it? Otherwise, maybe you weren't supposed to convert.

score 0 · Answer 2 · answered Nov 27 '18 at 12:47

I had the same question for an Coding Challenge. with some modification as the input was a List with (a 111101,b 110010,[newline] 111111 ....)

I took a different approach to solve it,using hashmap but still i too had only 2 sample test case passed.

below is my code:

public static String decode(List<String> codes, String encoded) {
    // Write your code here
         String result = "";
         String buildvalue ="";
         HashMap <String,String> codeMap= new HashMap<String,String>();
        for(int i=0;i<codes.size();i++){
           String S= codes.get(i);
           String[] splitedData = S.split("\\s+"); 
           String value=splitedData[0];
           String key=(splitedData[1].trim());            
         codeMap.put(key, value);
        }
        for(int j=0;j<encoded.length();j++){
              buildvalue+=Character.toString(encoded.charAt(j));
              if(codeMap.containsKey(buildvalue)){
                  if(codeMap.get(buildvalue).contains("[newline]")){
                    result+="\n";
                    buildvalue="";
                  }
                  else{
                   result+=codeMap.get(buildvalue);
                   buildvalue="";
                  }
              }
         }
         return result.toString();

    }

}

PHP Huffman Decode Algorithm

2 Answers2