2

I'm experimenting with compression algorithms and encountered an issue whereby my decoding algorithm would give the expected result where the encoded input was short, but after it reached a certain level of complexity it would start returning garbage.

Through some debugging steps I identified that the issue was caused by keeping track of the current node in my tree traversal algorithm in a "normal" variable:

auto currentNode = root;

After changing this to keep track of the current node in the tree using pointers the issue was resolved:

const TreeNode* currentNode = &root;

Alternatively, replacing:

currentNode = currentNode.children[bit];

with:

auto cache = currentNode.children[bit];
currentNode = cache;

also resolved the issue.

What I wasn't able to do (even with some support from someone else) was identify what the cause of the undefined behaviour was. Everything indicates it's something to do with the assignment here:

currentNode = currentNode.children[bit];

but that's all we could find.

What's the reason behind the undefined behaviour?


Code:

#include <vector>
#include <string>
#include <iostream>

struct TreeNode {
    char symbol; // Only relevant to leaf nodes
    std::vector<TreeNode> children; // Leaf nodes have 0 children, all other nodes have exactly 2

    TreeNode(unsigned char symbol) : symbol(symbol), children({}) { }
    TreeNode(unsigned char symbol, TreeNode left, TreeNode right) : symbol(symbol), children({ left, right }) { }
};

/// <summary>
/// Decodes provided `input` data up to `size` using the tree rooted at `root` to decode
/// </summary>
/// <param name="input">Encoded data</param>
/// <param name="root">Root node of decoding tree</param>
/// <param name="size">Size of unencoded data</param>
/// <returns>Unencoded data</returns>
std::vector<unsigned char> DecodeWithVars(const std::vector<unsigned char>& input, const TreeNode& root, int size) {
    std::vector<unsigned char> output = {};
    auto currentNode = root;

    for (auto& c : input) {
        for (int i = 0; i <= 7; i++) {
            int bit = (c >> (7 - i)) & 1;   // Iterating over each bit of each character in `input`

            currentNode = currentNode.children[bit];
            if (currentNode.children.size() == 0) {
                output.push_back(currentNode.symbol);
                currentNode = root;
                if (output.size() == size) {
                    return output;
                }
            }
        }
    }
    return output;
}

/// <summary>
/// Decodes provided `input` data up to `size` using the tree rooted at `root` to decode
/// Different from DecodeWithVars in that it uses a pointer to keep track of current tree node
/// </summary>
/// <param name="input">Encoded data</param>
/// <param name="root">Root node of decoding tree</param>
/// <param name="size">Size of unencoded data</param>
/// <returns>Unencoded data</returns>
std::vector<unsigned char> DecodeWithPointers(const std::vector<unsigned char>& input, const TreeNode& root, int size) {
    std::vector<unsigned char> output = {};
    const TreeNode* currentNode = &root;

    for (auto& c : input) {
        for (int i = 0; i <= 7; i++) {
            int bit = (c >> (7 - i)) & 1;   // Iterating over each bit of each character in `input`

            currentNode = &(*currentNode).children[bit];
            if ((*currentNode).children.size() == 0) {
                output.push_back((*currentNode).symbol);
                currentNode = &root;
                if (output.size() == size) {
                    return output;
                }
            }
        }
    }
    return output;
}


int main()
{
    std::string unencodedText = "AAAAAAAAAAAAAAABBBBBBBC,.,.,.,.,.,.CCCCCDDDDDDEEEEE";
    std::vector<unsigned char> data = { 0,0,0,1,36,146,78,235,174,186,235,155,109,201,36,159,255,192 };

    TreeNode tree = TreeNode('*',
                             TreeNode('*',
                                      TreeNode('A'),
                                      TreeNode('*',
                                               TreeNode('B'),
                                               TreeNode('C')
                                      )
                             ),
                             TreeNode('*',
                                      TreeNode('*',
                                               TreeNode('D'),
                                               TreeNode(',')
                                      ),
                                      TreeNode('*',
                                               TreeNode('.'),
                                               TreeNode('E')
                                      )
                             )
    );

    auto decodedFromPointers = DecodeWithPointers(data, tree, unencodedText.size());
    std::string strFromPointers(decodedFromPointers.begin(), decodedFromPointers.end());

    auto decodedFromVars = DecodeWithVars(data, tree, unencodedText.size());
    std::string strFromVars(decodedFromVars.begin(), decodedFromVars.end());


    std::cout << strFromPointers << "\n";
    std::cout << strFromVars << "\n";
    return 0;
}

For reference, tree represents the following tree:

enter image description here

Using MSVC (Microsoft (R) C/C++ Optimizing Compiler Version 19.29.30138 for x64) I get the following output using either C++17 or C++20:

AAAAAAAAAAAAAAABBBBBBBC,.,.,.,.,.,.CCCCCDDDDDDEEEEE
AAAAAAAAAAAAAAA*ADDDDDDE*.,.,.,.,.,.*,,,,.*ADDDDEEE

GCC (C++20 run on coliru, you may need to click edit in order to run) gave:

AAAAAAAAAAAAAAABBBBBBBC,.,.,.,.,.,.CCCCCDDDDDDEEEEE
AAAAAAAAAAAAAAA�������C,.,.,.,.,.,.CCCCCDDDDDDEEEEE

Clang (C++17 also on coliru) gave the same result:

AAAAAAAAAAAAAAABBBBBBBC,.,.,.,.,.,.CCCCCDDDDDDEEEEE
AAAAAAAAAAAAAAA�������C,.,.,.,.,.,.CCCCCDDDDDDEEEEE
Nick is tired
  • 6,860
  • 20
  • 39
  • 51
  • my guess is that `[bit]` goes out of range, but debug builds would normally pick that up. Try `children.at(bit) ` – pm100 Jan 31 '22 at 00:50
  • @pm100 Same result in all 3 compilers, runs fine and same outputs. Because of the nature of the tree all nodes have either 0 or 2 children, the root node is guaranteed to have 2, and after doing any traversal the first step is checking if the new node has 0 children (and if so, reverting back to the root of the tree). I don't believe it would be possible for `[bit]` to go out of range. – Nick is tired Jan 31 '22 at 00:53
  • ok, you have undefined behavior somewhere. Probably not in this code (you cant reason about UB). Time to break out valgrind – pm100 Jan 31 '22 at 00:57
  • With `auto current …` the code copies nodes every which way. I don’t know if that’s causing the problem, but it’s horribly inefficient and if the code needs to modify a node it will modify whatever copy it currently has and not the original node. – Pete Becker Jan 31 '22 at 01:24
  • @PeteBecker Yes, that's right, but there is no need for nodes to be modified, and while yes, it's not efficient, the fact that the nodes are copied is partly what is causing my confusion (I can't see how it would be possible for some issue caused by destruction on assignment for example, which is what my friends first thought was). I am actively using the pointer version as it is clearly the more optimal, but still wanted to ask about the UB. – Nick is tired Jan 31 '22 at 01:36
  • I reduced your issue to a [mre]: https://godbolt.org/z/c9Y6on1dE If this seems to indeed be the same issue you are seeing, you might want to add it to your question. – user17732522 Jan 31 '22 at 01:58
  • @user17732522 My friend also managed to produce the `terminate called after throwing an instance of 'std::bad_array_new_length'` error in their testing (GCC 11.2), but they went to bed and I was unable to reproduce and hence verify that. In MSVC the code that you've provided builds and runs successfully. They also at some point got a `terminate called after throwing an instance of 'std::bad_alloc'` when trying clang, but again, I don't know how unfortunately. – Nick is tired Jan 31 '22 at 02:01
  • @Nick The point is that the decoding doesn't even seem to be relevant. What do you get if you simply copy the tree after creating it with `auto new_tree = tree; new_tree = new_tree.children.at(0);` and then try to output the children's symbols of `new_tree`? I suspect you will also get garbage. – user17732522 Jan 31 '22 at 02:08
  • @user17732522 That was actually similar to the first thing that I tested, I output both the current children's symbols (both `'*'`) copied `children[0]` and then its symbol was `'A'`, which... clearly made no sense at all. It was never _total_ garbage (read: it was always something in the tree), but it was clearly not correct. – Nick is tired Jan 31 '22 at 02:18
  • @Nick And that happened also on MSVC? – user17732522 Jan 31 '22 at 02:19
  • @user17732522 Correct. – Nick is tired Jan 31 '22 at 02:19

1 Answers1

2

The line

currentNode = currentNode.children[bit];

will use the implicitly-defined copy assignment for TreeNode. This assignment operator will copy assign the member currentNode.children[bit].children into currentNode.children.

However the former is a subobject of an element of the latter. The element to which it belongs will be destroyed in the process of assigning a new value to the vector.

I am not exactly sure whether the standard library is required to make such an assignment work anyway, but the standard seems to only require that after the assignment the two sides of the assignment compare equal, which is not possible in the given situation.

user17732522
  • 53,019
  • 2
  • 56
  • 105