2

I have been trying to write a C++ code of a suffix trie however I want this code to keep track of counters at each node of how often a character or substring appears during the suffix trie construction: bearing in mind that am working with only 4 characters A,C,G and T

The code below is my attempt however its not working correctly:

#include<iostream>
#include <string>
#include <stdio.h>
#include <string.h>
using namespace std;

struct SuffixTreeNode{
    char c;
    struct SuffixTreeNode* one;
    struct SuffixTreeNode* two;
    struct SuffixTreeNode* three;
    struct SuffixTreeNode* four;
    //int count;

};

SuffixTreeNode* CreateNode(char ch){
    SuffixTreeNode* newnode=new SuffixTreeNode();
    newnode->c=ch;
    newnode->one=NULL;
    newnode->two=NULL;
    newnode->three=NULL;
    newnode->four=NULL;
    //count=0;
}   

SuffixTreeNode* Insert(SuffixTreeNode* root,char ch){
    if (root==NULL){
        root=CreateNode(ch);
    }
    else if(ch=='a'){
        root->one=Insert(root->one,ch);
    }
    else if(ch=='c'){
        root->two=Insert(root->two,ch);
    }
    else if(ch=='g'){
        root->three=Insert(root->three,ch);
    }
    else if(ch=='t') {
        root->four=Insert(root->four,ch);
    }

    return root;
}

bool Search(SuffixTreeNode* root, int data){
    if(root==NULL) return false;
    else if (root->c==data) return true;
    else if (root->c=='a')return Search(root->one,data);
    else if (root->c=='c')return Search(root->two,data);
    else if (root->c=='g')return Search(root->three,data);
    else return Search(root->four,data);
}

int main(){
    SuffixTreeNode* root=NULL;
    char str;
    root=Insert(root,'a');
    root=Insert(root,'c');
    root=Insert(root,'c');
    root=Insert(root,'t');
    root=Insert(root,'a');
    root=Insert(root,'g');
    cout<<"Enter character to be searched\n";
    cin>>str;

    if(Search(root,str)==true)cout<<"Found\n";
    else cout<<"Not found\n";
}
perfecto
  • 63
  • 5
  • 2
    And the C tag just slipped in, right? DOn't add tags for unrelated, **different** languages. – too honest for this site Mar 19 '16 at 16:01
  • 3
    Frankly `c++` tag should be taken down. This isn't c++... Why do you include c and c++ versions of the headers? Also do you really want c or c++? It begs to use objects. Also on a more general note. You are missing a question. It is not nice to say "Here's mine broken, debug it" and is considered off-topic under the clause: "*Questions seeking debugging help ("why isn't this code working?") must include the desired behavior, a specific problem or error and the shortest code necessary to reproduce it in the question itself.*" So, please, help others help you. – luk32 Mar 19 '16 at 18:30
  • 2
    @luk32 honnestly, with `` ``and `cout` it's definitively not C – Christophe Mar 19 '16 at 18:31
  • @perfect It's very interesting to see some genetic DNA/RNA information being processed in C++ however, if you need some help, you should explain what doesn't not work here, and give us input examples. We don't have a crystal bowl for that. – Christophe Mar 19 '16 at 18:33
  • @Christophe Yeah, well, I know it's not legal c. However seeing this my *opinion* is that it's illegal c, even though a c++ compiler might compile it. Honestly. It's way easier to make it proper legal c than c++, that's the reason. – luk32 Mar 19 '16 at 18:34
  • `SuffixTreeNode* CreateNode()` function doesn't return anything. Also, your code is basically keeping count of the characters entered. You don't need a suffix tree for that; just use a `map`. – Anmol Singh Jaggi Mar 19 '16 at 19:09

2 Answers2

2

The problem is that its design is flawed for the the search and insert: you do it for single characters, while the trie should work with a string.

Analysis of the problem

If you print out the trie you will see that you build a tree expanding the branch corresponding too the letter. You have done this because you insert one letter at a time, but this is not the normal layout of a trie :

enter image description here

Similarly, when you search for an element, if it's the root element, everything is ok. But if it's not the root element, your code will always search the branch corresponding to the current node, and this recursively, meaning that it will search only in the branch corresponding to the root.

First step towards a solution:correct the code

If you want to find any letter in the trie structure, you need to update your search to explore not the branch corresponding to the letter of the current node, but to the letter that is searched:

bool Search(SuffixTreeNode* root, int data){
    cout << (char)data<<"=="<<root->c<<"?"<<endl; 
    if(!root) return false;
    else if (root->c==data) return true;
    else if (data=='a')return Search(root->one,data);
    else if (data=='c')return Search(root->two,data);
    else if (data=='g')return Search(root->three,data);
    else return Search(root->four,data);
}

This corrects the code, not the underlying design. Here an online demo here.

But further work is needed to correct the design

The design should insert/search a string s. The idea would be to check current char with s[0] and recursively insert/search the remaining of the string s.substr(1);

Christophe
  • 68,716
  • 7
  • 72
  • 138
  • Thank you Christophe, that enlighted me a lot however to clarify what am trying to do since my question was not clear - I am trying to construct a suffix trie and be able to search in it, in C/C++. I am also trying to include counters as I construct the trie i.e. counters of how frequent a character/substring occurs for instance if I have my struct as follows: struct SuffixTrieNode{ char c; struct SuffixTreeNode* one; struct SuffixTreeNode* two; struct SuffixTreeNode* three; struct SuffixTreeNode* four; int count; }; – perfecto Mar 21 '16 at 09:46
  • - each node keeps track of its counter but for instance if we are at node "c" using Christophe diagram it meas the second c should keep track of how many "cc" are there. I had commented out "count" in my posted program because it was failing to work. And lastly I don't want the rootnode to have a character, I am stuck. @luk32 - sorry about that, I am a newbie - thanks for the advice - noted. – perfecto Mar 21 '16 at 09:46
  • Yes, the root nod shouldn't hold a character at all, because you start with nothing and from the first char on, you need to choose a branch. – Christophe Mar 21 '16 at 09:49
  • The picture I draw was for the trie corrsponding to th enodes you add in main(). However, this is not a correct trie, as you can't represent "a followed by a follwed by g". In your insertion logic, you can only represent "a followed by a, or lonesome g". Once you have solved this the counting will be a piece of cake – Christophe Mar 21 '16 at 09:52
  • [This video](https://www.youtube.com/watch?v=gfqS5nUH9ew) explains a little bit more tries and a c++ implementation and counts words with a given prefix. – Christophe Mar 21 '16 at 09:56
  • thanks for the video link but the link to the sample code is broken so I came up with this from the video, there are two functions i.e. the insert and the search – perfecto Apr 09 '16 at 09:03
  • Yes, and the insert is for a string, because the trie traversal has to validate a succession of related chars. Not individual chars on their own. I found the explanations of the video more interesting than the code itself (although it was on a prefix trie). If you have started with the stringification of your API and you're stuck again, open a new question with the current status (i guess i'll find back you current functions, but at node level rather than trie level) and let me know ;-) – Christophe Apr 09 '16 at 09:14
0

@Christophe - thanks so much for the video link however the link to the sample code is broken so I came up with this from the video, there are two functions i.e. insert and search as below

  void insert(string word)
{
    node* current=head;
    current->prefix_count++;
    for(unsigned int i=0;i<word.length();++i)
    {
        int letter=(int)word[i]-(int)'a';
        if (current->child[letter]==NULL)
            current->child[letter]=new node();
        current->child[letter]->prefix_count++;
        current=current->child[letter];
            }
    current->is_end=true;
}

bool search(string word)
{
    node *current=head;
    for(int i=0;i<word.length();++i)
    {
        if(current->child[((int)word[i]-(int)'a')]==NULL)
            return false;
        current=current->child[((int)word[i]-(int)'a')];
    }
    return current->is_end;
}

Then implemented the main as follows:

int main(){
node* head=NULL;

 string s="abbaa";
 init();
 insert(s);
 if(search("ab")==true) cout<<"Found"<<endl;
 else cout<<"Not found"<<endl;

}

And I am getting the following output: Not found

This is confusing since ab is found in the string s.

And lastly I am trying to understand this line :

int letter=(int)word[i]-(int)'a';

does this mean we are getting the ASCII code for 'a' and then subtract from the ASCII code of the current character?

Thank you

perfecto
  • 63
  • 5