Why octave error with function huffmandeco about large index types?

Question

I've got a little MatLab script, which I try to understand. It doesn't do very much. It only reads a text from a file and encode and decode it with the Huffman-functions. But it throws an error while decoding:

"error: out of memory or dimension too large for Octave's index type
error: called from huffmandeco>dict2tree at line 95 column 19"

I don't know why, because I debugged it and don't see a large index type.

I added the part which calculates p from the input text.

%text is a random input text file in ASCII

%calculate the relative frequency of every Symbol
for i=0:127
    nlet=length(find(text==i));
    p(i+1)=nlet/length(text);
end
symb = 0:127;
dict = huffmandict(symb,p); % Create dictionary
compdata = huffmanenco(fdata,dict); % Encode the data
dsig = huffmandeco(compdata,dict); % Decode the Huffman code

I can oly use octave instead of MatLab. I don't know, if there is an unexpected error. I use the Octave Version 6.2.0 on Win10. I tried the version for large data, it didn't change anything.
Maybe anyone knows the error in this context?

EDIT: I debugged the code again. In the function huffmandeco I found the following function:

function tree = dict2tree (dict)

  L = length (dict);
  lengths = zeros (1, L);

  ## the depth of the tree is limited by the maximum word length.
  for i = 1:L
    lengths(i) = length (dict{i});
  endfor
  m = max (lengths);

  tree = zeros (1, 2^(m+1)-1)-1;

  for i = 1:L
    pointer = 1;
    word    = dict{i};
    for bit = word
      pointer = 2 * pointer + bit;
    endfor
    tree(pointer) = i;
  endfor

endfunction

The maximum length m in this case is 82. So the function calculates:
tree = zeros (1, 2^(82+1)-1)-1.
So it's obvious why the error called a too large index type.
But there must be a solution or another error, because the code is tested before.

I guess it maybe caused by `p`, which you do not show. This is why a [mcve] is important — Ander Biguri, Apr 15 '21 at 10:46
Please include which version of octave you are using and which operating system you are running on. Check out this older question with similar problem: https://stackoverflow.com/questions/45881343/octave-out-of-memory-or-dimension-too-large-for-octaves-index-type/45882743 — Nick J, Apr 15 '21 at 13:03
I know the download page includes a version for windows compile with large array support and 64bit indexing. Perhaps that would solve your problem? See https://octave.org/download#ms-windows — Nick J, Apr 15 '21 at 13:11
What is the minimum size for `text` to produce the error? What is the size of `compdata` after encoding? I see that the documentation states that the "signal set must strictly belong in the range [1,N] with N = length (dict)" and your signal includes 0, but I don't know if that would cause the error you're seeing. — beaker, Apr 15 '21 at 15:04
I've set text to a string with 20 ASCII-symbols. The error occurs anyway. But compdata is very large. The length is 1022 bit if I input the 140 bit string. I don't know what that mean and I set the symb to symb=1:128, but the error is the same — newOne, Apr 15 '21 at 15:12
It seems that `huffmandict` does not like zero-probability symbols. It's creating extremely long codes (>100 bits), which causes the decoding to choke. The comments say that it `huffmandict` doesn't assign a codeword to zero-probability symbols, but it does. The workaround is to restrict the symbols to the unique symbols actually present in the signal. — beaker, Apr 15 '21 at 20:08
@newOne the huffman encoding/decoding functions seem to have a slightly different interface in octave vs matlab, even though they both fundamentally do the same thing. See here (https://stackoverflow.com/q/66929744/4183191) for an example where code works in matlab but needs slight modification in octave: https://stackoverflow.com/q/66929744/4183191 (note that the octave code itself would be compatible with matlab, but not the other way round). You may simply be bumping into one such edge case. — Tasos Papastylianou, Apr 16 '21 at 11:59
@newOne tbh, this seems to be a bug with the huffmandeco.m implementation in the communications package. I would encourage you to look at the bug tracker if there is a bug report, and if not file one. Occasionally things have been fixed in 'dev' but not on yet released ... you could try the dev version and see if this works better. — Tasos Papastylianou, Apr 16 '21 at 12:43
The current implementation is from a commit in 2011 https://sourceforge.net/p/octave/communications/ci/83aeb09e7255c3953ac11b56299dab426831b419/ If you open a bug, you could mention that to speed up the process. I'm happy to open the bug if you'd like me to. — Tasos Papastylianou, Apr 16 '21 at 12:51
@TasosPapastylianou I didn't revert any tags, but I don't understand why the MATLAB tag would be helpful to this question at all. Tags are intended to indicate what knowledge is necessary to understand and answer the question. If you understand Octave and Huffman Coding, there's no need to have any specific experience with MATLAB. Tags are also used to search existing questions for answers. Anybody who found this while searching for Huffman Coding in MATLAB would be sorely disappointed to find this question specific to the Octave implementation. — beaker, Apr 16 '21 at 13:54
@TasosPapastylianou you can open the bug if you want. I am a beginner with all of this and don't really know how to do that. I think I found it on myself and learned a lot with it. And I'm really happy that it works now. Thank you for your help! — newOne, Apr 16 '21 at 14:34
@beaker this is not a "huffman on octave exclusively" question. This is explicitly a "huffman works on matlab not on octave, why?" question. It explicitly requests the expertise of people who are familiar / have access to both languages for testing, on functionality that, at least superficially, appears to belong to the common subset of both languages. Before checking there is no way to know if this is due to a bug on the Octave or Matlab side of things, or simply reflects entirely different APIs (which is of sufficient interest to a future reader interested in _both_ systems, like OP here). — Tasos Papastylianou, Apr 17 '21 at 10:43
@beaker Also, more generally, people seem too trigger-happy removing the 'matlab' tag from questions which more often than not would totally benefit from a matlab perspective. E.g. writing "endfor" instead of "end" triggers immediate removal, but typically the bug has nothing to do with it. To me, given a matlab tag, the useful thing to do would simply be to tell the user they should use "end" if aiming for the common M/O subset. This vigilante behaviour just seems odd to me, to say the least. I haven't seen a zealot adding octave tags to compatible matlab questions with the same ferocity yet. — Tasos Papastylianou, Apr 17 '21 at 10:51
@TasosPapastylianou You make a good point in this case about the OP not knowing beforehand whether this was a MATLAB or an Octave problem, but when you say there are questions that would benefit from a MATLAB perspective, you kind of lose me. I would argue that nearly all of the MATLAB users that answer questions on SO are at least aware of Octave. They are free to follow the Octave tag if they choose. It seems that you are advocating adding tags for visibility only, which is not what tags should be used for, and also why you don't see people *adding* the Octave tag to random MATLAB questions. — beaker, Apr 17 '21 at 15:16
@beaker I'm not 'advocating' anything, I just find the behaviour frustrating. At best it's unnecessary, and often enough it is downright annoying, when the double tag actually makes sense. Tags are not supposed to enforce some sort of 'tribal' allegiance, they are there to categorise things appropriately. People going round removing matlab tags from otherwise valid matlab/octave questions seems to me like a really weird thing to do. Especially in questions that specifically refer to both environments explicitly. — Tasos Papastylianou, Apr 17 '21 at 17:33
Bug submitted to the Octave But Tracker: [Bug #60409](https://savannah.gnu.org/bugs/index.php?60409) — Tasos Papastylianou, Apr 17 '21 at 18:03
@newOne incidentally, there's been a small surge of huffman questions recently in SO ... do you mind if I ask how you ended up working with these functions? I'm trying to figure out if there's a new nanodegree course using huffman encoding on Matlab/Octave or something :) — Tasos Papastylianou, Apr 17 '21 at 18:06
@TasosPapastylianou actually we get the script from our prof to demonstrate how the Huffman functions work. And I ran straight into the problem. I don't know any nanodegree course, I'm studying communication engineering :) — newOne, Apr 18 '21 at 17:01
@TasosPapastylianou Regarding your bug report, the root cause is **not** the memory allocation in `huffmandeco`, it is the fact that `huffmandict` is generating invalid codes when there are zero-probability symbols. You have to ask yourself why there are 82-bit codes generated for 7-bit ASCII values. — beaker, May 18 '21 at 18:43
@beaker that's a very good point ... you should comment this on the bug report! (no account needed) — Tasos Papastylianou, May 18 '21 at 21:32
@TasosPapastylianou Actually, I have to take that back. They could easily compress the max length for dictionaries in which there are multiple zero-probability symbols, but there will always be some cases in which Huffman coding gives you really long code words. That said, I don't find it surprising that trying to allocate an array with 2^82 elements raises an error when `sizemax()` is 2^63. — beaker, May 19 '21 at 14:44

score 1 · Accepted Answer · answered Apr 15 '21 at 20:46

I haven't weeded through the code enough to know why yet, but huffmandict is not ignoring zero-probability symbols the way it claims to. Nor have I been able to find a bug report on Savannah, but again I haven't searched thoroughly.

A workaround is to limit the symbol list and their probabilities to only the symbols that actually occur. Using containers.Map would be ideal, but in Octave you can do that with a couple of the outputs from unique:

% Create a symbol table of the unique characters in the input string
% and the indices into the table for each character in the string.
[symbols, ~, inds] = unique(textstr);
inds = inds.';   % just make it easier to read

For the string

textstr = 'Random String Input.';

the result is:

>> symbols
symbols =  .IRSadgimnoprtu
>> inds
inds =
 Columns 1 through 19:
    4    6   11    7   12   10    1    5   15   14    9   11    8    1    3   11   13   16   15
 Column 20:
    2

So the first symbol in the input string is symbols(4), the second is symbols(6), and so on.

From there, you just use symbols and inds to create the dictionary and encode/decode the signal. Here's a quick demo script:

textstr = 'Random String Input.';
fprintf("Starting string: %s\n", textstr);

% Create a symbol table of the unique characters in the input string
% and the indices into the table for each character in the string.
[symbols, ~, inds] = unique(textstr);
inds = inds.';   % just make it easier to read

% Calculate the frequency of each symbol in table
% max(inds) == numel(symbols)
p = histc(inds, 1:max(inds))/numel(inds);

dict = huffmandict(symbols, p);
compdata = huffmanenco(inds, dict);
dsig = huffmandeco(compdata, dict);

fprintf("Decoded string: %s\n", symbols(dsig));

And the output:

Starting string: Random String Input.
Decoded string: Random String Input.

To encode strings other than the original input string, you would have to map the characters to symbol indices (ensuring that all symbols in the string are actually present in the symbol table, obviously):

>> [~, s_idx] = ismember('trogdor', symbols)
s_idx =
   15   14   12    8    7   12   14

>> compdata = huffmanenco(s_idx, dict);
>> dsig = huffmandeco(compdata, dict);
>> fprintf("Decoded string: %s\n", symbols(dsig));
Decoded string: trogdor

Thank you for your solution. I will try this. But your brought up the right idea. I searched in the code of huffmandict for the reason why the description isn't correct. And I found an outcommanded part, which filtered the characters with a prob of zero. I have no idea why this should be outcommanded... Now it works! I only added a correction of ASCII symbols which numbering begins at 0 and Octave expects the beginning at 1. — newOne, Apr 16 '21 at 13:47
If you get an ASCII character of 0 in a text string, I would be very very surprised, but I suppose it's possible. — beaker, Apr 16 '21 at 13:49
No, I don't get a charakter of 0. The dict is created with the characters as an index. And because of the numbering, the characters probabilities are shifted by one. So e.g. the space has a probability of zero. So the decoded text hasn't had any spaces. I hope I'm getting it right, but it works only with the shift and reshift. — newOne, Apr 16 '21 at 14:20

Why octave error with function huffmandeco about large index types?

1 Answers1