Return the sub string

Question

How can I use this for x amounts of numbers ? for my approach, its hard coded to 2 substrings. Also is there a better way with less time complexity? There may be a loop hole over here which needs to be fixed about the number of num im passing as I am not using the um parameter at all.

You cannot edit and remove the question!!! – Jeffrey Oct 02 '19 at 02:36 — Jeffrey, Oct 02 '19 at 02:36

paddy · Accepted Answer · 2019-10-02T01:11:26.507

Your current approach has a few problems, including a hard-coded maximum number of ngrams, and the fixed ngram size. In addition, your short variable names and lack of comments do not help explain the code to whoever is reading it.

A simpler solution is to use a map to count the number of times each ngram occurs, and then find the one with the highest count. That would give rougly N.logN time complexity. Alternatively unordered_map would be closer to linear time complexity.

There will of course be an edge case where more than one ngram occurs the same highest count. You would need to decide which of a variety of strategies should be used to resolve that. In my example, I take advantage of intrinsic ordering of std::map to select the ngram with the lowest sort order. If using unordered_map, you'd need a different strategy for resolving contention in a deterministic way.

#include <algorithm>
#include <iostream>
#include <map>
#include <string>

std::string ngram(const std::string &input, int num)
{
    if (num <= 0 || num > input.size()) return "";

    // Count ngrams of size 'num'
    std::map<std::string, int> ngram_count;
    for(size_t i = 0; i <= input.size() - num; i++)
    {
        ++ngram_count[input.substr(i, num)];
    }

    // Select ngram with highest count
    std::map<std::string, int>::iterator highest = std::max_element(
        ngram_count.begin(), ngram_count.end(),
        [](const std::pair<std::string, int>& a, const std::pair<std::string, int>& b)
        {
            return a.second < b.second;
        });

    // Return ngram with highest count, otherwise empty string
    return highest != ngram_count.end() ? highest->first : "";
}

int main()
{
    std::cout << ngram("engineering", 2) << std::endl;
    std::cout << ngram("engineering", 3) << std::endl;
    return 0;
}

score 0 · Answer 2 · answered Oct 02 '19 at 01:16

I did it a bit different than paddy, so I thought I would post it. Use std::set. He explains the issue so should get the credit for your answer.

struct test {
    test(const std::string& str) :val(str), cnt(0) {}
    test(const test& thet) { *this = thet; }
    std::string val;
    int cnt;
    friend bool operator < (const test& a, const test& b) { return a.val < b.val; }
};
using test_set_type = std::set<test>;

const test ngram(std::string A, int num) {
    test_set_type set;
    for (auto it = A.begin(); it < A.end() - num + 1; ++it)
    {
        auto found = set.find(std::string(it, it + num));
        if (found != set.end())
            ++const_cast<test&>(*found).cnt;
        else
            set.insert(std::string(it, it + num));
    }
    int find = -1;
    test_set_type::iterator high = set.begin();
    for (auto it = set.begin(); it != set.end(); ++it)
        if(it->cnt > find)
            ++find, high= it;
    return *high;
}


int main() {
    int num = 2;
    std::string word("engineering");
    std::cout << ngram(word, num).val << std::endl;
    return 0;
}

Return the sub string

2 Answers2