0

I have a flat representation of a tree shown in the table below. The unsorted data, std::vector is:

unsorted vector
(id)    (path)          (fn)    (line)  (extra)
1       /abc/file3.c    foo0    10      1
2       /abc/file3.c    foo0    15      2
3       /abc/file3.c    foo0    20      1
4       /abc/file3.c    foo1    30      1
5       /abc/file3.c    foo1    35      2
6       /abc/file3.c    foo1    40      1
7       /abc/file1.c    foo2    10      1
8       /abc/file1.c    foo2    15      2
9       /abc/file1.c    foo2    20      1
10      /abc/file3.c    baz1    70      1
11      /abc/file3.c    baz1    75      2
12      /abc/file3.c    baz1    80      1
13      /abc/file2.c    bat     10      1
14      /abc/file2.c    bat     15      2
15      /abc/file2.c    bat     17      2
16      /abc/file2.c    bat     20      1
17      /def/file2.c    baz     70      1
18      /def/file2.c    baz     71      1
19      /def/file2.c    baz     72      1
20      /def/file2.c    baz     73      1

The columns represent 'ID', 'path', 'function', 'linenumber' and 'extra'. The data in tree form is hierarchically ordered as path->funcion->lineNumber (each path contains multiple functions, which contains multiple lines of interest (probe points)).

Each row in this table is represented with this struct:

using Type = enum class Type : unsigned {
    One = 1,
    Two = 2
};

using MyStruct = struct MyStruct {
    unsigned id;
    std::string filename;
    std::string function;
    unsigned lineNum;
    Type type;
};

After sorting this data using the hierarchy described above (via the following comparator)

// comparator used for unique
static const auto customComp = [](const auto& lhs, const auto& rhs) {
    return std::tie(lhs.filename, lhs.function, lhs.lineNum, lhs.type) <
        std::tie(rhs.filename, rhs.function, rhs.lineNum, rhs.type);
    };

We end up with the correctly ordered vector:

sorted vector
(id)    (path)          (fn)    (line)  (extra)
7       /abc/file1.c    foo2    10      1
8       /abc/file1.c    foo2    15      2
9       /abc/file1.c    foo2    20      1
13      /abc/file2.c    bat     10      1
14      /abc/file2.c    bat     15      2
15      /abc/file2.c    bat     17      2
16      /abc/file2.c    bat     20      1
10      /abc/file3.c    baz1    70      1
11      /abc/file3.c    baz1    75      2
12      /abc/file3.c    baz1    80      1
1       /abc/file3.c    foo0    10      1
2       /abc/file3.c    foo0    15      2
3       /abc/file3.c    foo0    20      1
4       /abc/file3.c    foo1    30      1
5       /abc/file3.c    foo1    35      2
6       /abc/file3.c    foo1    40      1
17      /def/file2.c    baz     70      1
18      /def/file2.c    baz     71      1
19      /def/file2.c    baz     72      1
20      /def/file2.c    baz     73      1

I need to parse this data using the new ranges or ranges-v3 API to efficiently recreate the tree structure from which the table originated. I specify ranges here firstly as I am learning my way through this complicated API, but also because the API seems to show a very efficient way of handling large data sets by lazy evaluation).

The following code works (which is also in godbolt), however it seems wrong. I am using a pair of nested ranges chunk_by loops to parse the data. I need to terminate the outer loop early by a break.

The main body of the code is here:

// comparator used for unique
static const auto customComp = [](const auto& lhs, const auto& rhs) {
    return std::tie(lhs.filename, lhs.function, lhs.lineNum, lhs.type) <
        std::tie(rhs.filename, rhs.function, rhs.lineNum, rhs.type);
    };

int
main() {
    print("unsorted vector", structs);
    // split the sorted probes into chunks
    actions::sort(structs, customComp);
    const auto outerComp = [](auto&& lhs, auto&& rhs) {
            return lhs.filename == rhs.filename;
        };
    const auto innerComp = [](auto&& lhs, auto&& rhs) {
            return lhs.function == rhs.function;
        };
    print("sorted vector", structs);
    std::cout << std::endl;
    // split sorted list of probes into chunks by filename
    for (const auto& sources : structs | views::chunk_by(outerComp)) {
        auto foo = sources.size();
        for (const auto& next : sources) {
            auto outcomes = 0;
            for (const auto& functions : sources | views::chunk_by(innerComp)) {
                for (const auto& probe : functions) {
                    outcomes += (probe.type == Type::Two) ? 2 : 1;
                    std::cout << std::format("{}\n", probe);
                }
            }
            std::cout << next.filename << " outcomes [" << outcomes << "]\n";
            break;
        }
        std::cout << "\n";
    }
}

Would it be possible to perform the sort and double chunking on a single for loop? I would ideally like to use the composition form of the ranges API to achieve the best result.

康桓瑋
  • 33,481
  • 5
  • 40
  • 90
johnco3
  • 2,401
  • 4
  • 35
  • 67
  • 1
    If you want to represent a tree structure, it may make more sense to represent it more directly, with something like: `using Fn = std::map; std::multimap records;` so `records` has paths, each of which contains some functions, and each function has a name and some line numbers. With something like this, you should be insert each line independently. – Jerry Coffin Jul 17 '23 at 18:23
  • @JerryCoffin Thanks Jerry, that is a good suggestion, however the main point was I would like to use ranges - especially if the dataset gets huge. Also from what I know ranges have the advantage of lazy evaluation. I'm quite new to ranges and I'm struggling with the api while also appreciating its potential power for large sizes of data. the multi map would require preallocation which defeats the purpose of what I am trying to achieve. – johnco3 Jul 17 '23 at 20:04

1 Answers1

1

Given that each record contains all the information necessary to do so, I'd just create something to represent the tree structure, and insert records into it, rather than sort, and then parse out ranges from the sorted records.

#include <iostream>
#include <sstream>
#include <algorithm>
#include <iterator>
#include <map>
#include <string>

// Keep the code self-contained, though in real use you undoubtedly want to 
// read the raw data from a file, or something on that order.
char const *rawData = R"(
1       /abc/file3.c    foo0    10      1
2       /abc/file3.c    foo0    15      2
3       /abc/file3.c    foo0    20      1
4       /abc/file3.c    foo1    30      1
5       /abc/file3.c    foo1    35      2
6       /abc/file3.c    foo1    40      1
7       /abc/file1.c    foo2    10      1
8       /abc/file1.c    foo2    15      2
9       /abc/file1.c    foo2    20      1
10      /abc/file3.c    baz1    70      1
11      /abc/file3.c    baz1    75      2
12      /abc/file3.c    baz1    80      1
13      /abc/file2.c    bat     10      1
14      /abc/file2.c    bat     15      2
15      /abc/file2.c    bat     17      2
16      /abc/file2.c    bat     20      1
17      /def/file2.c    baz     70      1
18      /def/file2.c    baz     71      1
19      /def/file2.c    baz     72      1
20      /def/file2.c    baz     73      1
)";

struct record {
    int id;
    std::string path;
    std::string fn;
    int lineNumber;
    int type;

    bool operator<(record const &rhs) const { 
        return std::tie(path, fn, lineNumber, type) < std::tie(rhs.path, rhs.fn, rhs.lineNumber, rhs.type);
    }

    friend std::istream &operator>>(std::istream &is, record &r) { 
        return is >> r.id >> r.path >> r.fn >> r.lineNumber >> r.type;
    }
    friend std::ostream &operator<<(std::ostream &os, record const &r) { 
        return os << r.id << "\t" << r.path << "\t" << r.fn << "\t" << r.lineNumber << "\t" << r.type;
    }
};

struct Probe {
    int line;
    int type;

    friend std::ostream &operator<<(std::ostream &os, Probe const &p) { 
        return os << "\t\t" << p.line << " " << p.type;
    }
};

class FuncRec { 
    std::vector<Probe> probes;
public:

    void insert(record const &rec) { 
        probes.push_back(Probe{rec.lineNumber, rec.type});
    }

    friend std::ostream &operator<<(std::ostream &os, FuncRec const &f) { 
        for (auto const &p : f.probes) {
            os << p << "\n";
        }
        return os;
    }
};

class FileRec { 
    std::map<std::string, FuncRec> functions;
public:
    void insert(record const &rec) { 
        functions[rec.fn].insert(rec);
    }

    friend std::ostream &operator<<(std::ostream &os, FileRec const &f) {
        for (auto const &f : f.functions) { 
            os << "\t" << f.first << "\n";
            os << f.second;
        }
        return os;
    }
};

class Tree {
    std::map<std::string, FileRec> files;

    void insert(record const &rec) { 
        files[rec.path].insert(rec);
    }

public:

    Tree(std::vector<record> const &in) {
        for (auto const &r : in)
            insert(r);
    }

    friend std::ostream &operator<<(std::ostream &os, Tree const &t) {
        for (auto const &f : t.files) { 
            os << f.first << "\n";
            os << f.second;
        }
        return os;
    }
};

int main() {
    std::stringstream infile(rawData);

    std::vector<record> recs { std::istream_iterator<record>(infile), {}};

    Tree tree{recs};

    std::cout << "Tree struture:\n";
    std::cout << tree;

    // In case you also want to show sorted structs:
    std::cout << "\nSorted records:\n";
    std::sort(recs.begin(), recs.end());
    for (auto const &r : recs) {
        std::cout << r << "\n";
    }
    std::cout << "\n";

}

This is probably a bit more elaborate than really needed. For example, FuncRec doesn't really accomplish much. We could just embed the vector of Probes in its parent (but I'm assuming this is kind of a simplified version of something more elaborate, where FuncRec might serve more purpose.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111