3

I need to do a program that separate from 3 to the size of a string and compare to the others sequences of 3 in the same string given. I'm going to explain it.

User introduce this DNA string = "ACTGCGACGGTACGCTTCGACGTAG" For example. We start with n = 3, this is, we take the first three caracters for comparing in the DNA.

The first characters are: "ACT", and we need to compare it with the other sequences of three, like, [CTG,TGC,GCA... until the end].

If we find another sequence equal to "ACT", we save the position. Here is another example:

DNA: "ACTGCGACGGTACGCTTCGACGTAG" and we find this sequences in his positions:

  1. ACG: 7 - 12 - 20
  2. CGA: 5 - 18
  3. GAC: 6 - 19
  4. GTA: 10 - 22
  5. CGAC: 5 - 18
  6. GACG: 6 - 19
  7. CGACG: 5 - 18 The number is the position of the start of the sequence:

ACTGCGACGGTACGCTTCGACGTAG

You can see that the n = 3, increment in 1 when the we end to find by n = 3, the variable pass to n=4, until n = DNA.size().

My problem is that i have one function for divide the string in a little sequences of the DNA, and I do a push_back() for saving in the vector, and then I can see if there is more sequences or not, but i don't know how can i get the position.

I can use the library algorithm, and for sure, in this library there is a function that do this but i don't know so much this library.

Here is my code:

#include <iostream>
#include <string>
#include <vector>
#include <algorithm>

using namespace std;

const string DNA = "ACTGCGACGGTACGCTTCGACGTAG";
const size_t taille = DNA.size();

size_t m = 3;
vector<string> v;

/*
struct DNA{
    const string dna;  // chaine saisie pour l'utilisateur
    size_t taille;  // Taille de la chaine
    string chaine;  // Chaine à chercher
};
*/

// what kind of structs can i create? for me it's stupid to make any struct in this program.

bool checkDNA(string &s);
string takeStrings(const string &s,size_t i, size_t m);
void FindSequenceDNA(vector<string>&s,string sq);
size_t incrementValue(size_t &m);



int main(){

    string DNAuser;
    cout << "Introduce the DNA: ";
    cin >> DNAuser;

    bool request;
    cout << boolalpha;
    request = DNAuser.find_first_not_of("AGCT");
    cout << request << endl;

    vector<string> vectorSq;
    size_t auxiliar = 0;
    string r;
    size_t ocurrencies = DNA.size()-2;
    cout << "DNA: " << DNA << endl;
    while(auxiliar<ocurrencies){        // This gonna be works with the ocurriences, from 1 to end.
        r = takeStrings(DNA,auxiliar,auxiliar+m);
        auxiliar++;
        if(r.size()==m){
            vectorSq.push_back(r);
        }
    }

    // string res = takeStrings(DNA,0,3);
    // cout << "res: " << res << endl;
    // cout << "Printing vector: " << endl;

    // I just need to find the other, the practice is almost done.

    for(size_t i = 0; i< vectorSq.size(); i++){
        cout << vectorSq[i] << endl;
    }

    return 0;

}


string takeStrings(const string &s,size_t i, size_t m){
    string result;
    size_t aux=i;
    if(s.size()==0){
        cout << "String is empty." << endl;
    }
    else{
        for(;i<s.size()&&i!=m;i++){
            result+=s[i];
            aux++;
        }

    }
    return result;
}

void FindSequenceDNA(vector<string>&s,string sq){
    if(s.size()==0){
        cout << "DNA invalid." << endl;
    }
    else{
        for(size_t i=0;i<s.size();i++){
            if(sq==s[i]){
                cout << "function: " << endl;
                cout << s[i] << endl; // I need to calculate the real position in the string, not in the vector
            }
        }
    }

}

bool checkDNA(string &s){
    bool res;
    if(s.size()==0 || s.size()<3){
        cout << "DNA invalid" << endl;
    }
    else{
        for(size_t i=0;i<s.size();i++){
            if(s[i]=='A' || s[i]=='C' || s[i]=='G' || s[i]=='T')
            {
                res = true;
            }
            else{
                res= false;
            }
        }
    }
    return res;
}

size_t incrementValue(size_t &m){
    if(m<DNA.size()){
        m++;
    }
    return m;
}
thecatbehindthemask
  • 413
  • 1
  • 6
  • 15
  • Just to be clear : you want to do one pass searching for 3-letters patterns, then one pass searching for 4-letter patterns, etc... until you do one pass for N-letters patterns where N = sizeof(input) - 1 ? Oo – Félix Cantournet Mar 25 '15 at 13:15
  • ohgod this is extremely hard to read, could you rephrase your entire question, please? – specializt Mar 25 '15 at 13:16
  • One way would be to use [std::string::find_first_of](http://en.cppreference.com/w/cpp/string/basic_string/find_first_of) after you find the subsequences. In this case however, using KMP/BM will probably be best in terms of the time efficiency. Preprocessing the text is a good idea as you are doing several searches on it afterwards. – Anirudh Ramanathan Mar 25 '15 at 13:18
  • @FélixCantournet yeah it is, sorry for the explanation but it's hard for me explain this in english... – thecatbehindthemask Mar 25 '15 at 13:19
  • @AndresDuque Well there are 2 options here : one is to do N-passes. It will be slow. You can use an integer to iterate over the string and keep the index, then save it : look at Mohit Jain answer's. OR you could use a pattern matching library which will have a much more optimized algorithm. – Félix Cantournet Mar 25 '15 at 13:21
  • @AnirudhRamanathan what is KMP/BM ? and can you explain it more? I find the subsequences, yes, i can use find but at the end, i can't get the positiones because in the vector the position is different. – thecatbehindthemask Mar 25 '15 at 13:22
  • @AndresDuque Can you use the `` header ? – Félix Cantournet Mar 25 '15 at 13:24
  • @AndresDuque [Boyer-Moore](http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/fstrpos-example.html) and [KMP](http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/kpm-example.html). Another good option is [Rabin Karp](http://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm). – Anirudh Ramanathan Mar 25 '15 at 13:27
  • @AndresDuque I suggest you take a look at this paper for your algorithm : http://www.cs.utexas.edu/users/moore/publications/sustik-moore.pdf It was designed to look for patterns in DNA sequences. – Félix Cantournet Mar 25 '15 at 13:31
  • This looks exactly like what you would do in the preprocessing step of the [Z-algorithm](http://ivanyu.me/blog/2013/10/15/z-algorithm/). You'ld then just have to loop on the array and save the indices where the value is 3. – Rerito Mar 25 '15 at 14:35
  • @Rerito yeah! if isn't the same, it's so similar! – thecatbehindthemask Mar 25 '15 at 14:48

2 Answers2

1

How about:

std::map< std::string, std::vectpr<int> > msvi;
std::size_t len = dna.size();
for(size_t from = 0; from < len; ++from) {
  for(size_t sz = 3; sz < len; ++sz) {
    msvi[ dna.substr(from, sz ].push_back(from);
  }
}

This creates all strings of size 3 and saves there position in a map.

Live demo link

Print only the items with 2 or more instances


As you don't want to use std::map, you can construct a trie as shown on this page written in C. Change your tree node to:

struct tree_node {
  vector<int> starts;
  struct tree_node *children[26];  /* A to Z */
};
Mohit Jain
  • 30,259
  • 8
  • 73
  • 100
1

Based on Mohit's answer but re-uses pointers to possibly, get better performance (vs string.substr)

#include <iostream>
#include <cstring>
#include <vector>
#include <string>

using namespace std;

static const char* DNAdata = "ACTGCGACGGTACGCTTCGACGTAG";
static const size_t len = strlen(DNAdata);

vector< vector< string > > uniqueKeys(len);
vector< vector< vector<size_t> > > locations(len);


void saveInfo(const char* str, size_t n, size_t loc) {
   vector<string>& keys = uniqueKeys[n-1];
   vector<vector<size_t> >& locs = locations[n-1];

   bool found = false;
   for (size_t i=0; i<keys.size(); ++i) {
      if (keys[i] == str) {
     locs[i].push_back(loc);
     found = true;
     break;
      }
   }
   if (!found) {
      vector<size_t> newcont;
      newcont.push_back(loc);
      keys.push_back(str);
      locs.push_back(newcont);
   }
}

void printInfo(const char* str) {
   cout << str << endl;
   size_t len = strlen(str);
   vector<string>& keys = uniqueKeys[len-1];
   vector<vector<size_t> >& locs = locations[len-1];
   for (size_t i=0; i<keys.size(); ++i) {
      if (keys[i] == str) {
     vector<size_t>& l = locs[i];
     vector<size_t>::iterator iter = l.begin();
     for (; iter != l.end(); ++iter) {
        cout << *iter << endl;
     }

     break;
      }
   }
}

int main() {
   char* DNA = new char[len+1];
   strcpy(DNA, DNAdata);
   char* end = DNA+len;
   char* start = DNA;
   for (size_t n =3; n<=len; ++n) {
      size_t loc = 0;
      char* p = start;   
      char* e = p+n;
      while (e <= end) {     
     char save = *e;
     *e = 0;
     saveInfo(p++, n, loc++);
     *e = save;
     ++e;
      }
   }
   delete[] DNA;

   printInfo("GTA");
   printInfo("ACTGCGACGGTACGCTTCGACGTA");

   return 0;
}

To print all:

void printAll() {
   for (size_t n=3; n<=len; ++n) {
      cout << "--> " << n << " <--" << endl;
      vector<string>& keys = uniqueKeys[n-1];
      vector<vector<size_t> >& locs = locations[n-1];
      for (size_t i=0; i<keys.size(); ++i) {
     cout << keys[i] << endl;
     vector<size_t>& l = locs[i];
     vector<size_t>::iterator iter = l.begin();
     for (; iter != l.end(); ++iter) {
        cout << *iter << endl;
     }
      }
   }
}
Mustafa Ozturk
  • 812
  • 4
  • 19
  • @AndresDuque updated the code to provide full solution – Mustafa Ozturk Mar 25 '15 at 14:16
  • This code is almost perfect, the info of "GTA" is correct and I just need to print the info of all sequences and increment the lenght of the sequences too! but this is very helpful!! thanks a lot! @MustafaOzturk – thecatbehindthemask Mar 25 '15 at 14:21
  • @AndresDuque what do you mean "increment the length of the sequences"? The code above goes though all lengths >=3. – Mustafa Ozturk Mar 25 '15 at 14:22
  • You start with n=3, and get the subsequences of size 3, and then search it in the DNA, and when you finish this, you increment the n, n=4, and do the same, i just iterate a value for getting the subsequences and compare like you do in the string, and get the positions @MustafaOzturk – thecatbehindthemask Mar 25 '15 at 14:25
  • @AndresDuque small bug in the code, should be `while (e <= end)` – Mustafa Ozturk Mar 25 '15 at 14:30
  • this is perfect!! @MustafaOzturk, thanks a lot!! I forgot to say I need to save in a vector or something the subsequences for doesn't repeat the same search, and just printing the subsequences that repeat more than one: the output would have to be like this ACG: 7 - 12 - 20 CGA: 5 - 18 GAC: 6 - 19 GTA: 10 - 22 CGAC: 5 - 18 GACG: 6 - 19 CGACG: 5 - 18. – thecatbehindthemask Mar 25 '15 at 14:46
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/73788/discussion-between-andresduque-and-mustafa-ozturk). – thecatbehindthemask Mar 25 '15 at 15:53
  • how can i only print the subsequences that has more than one position found . – thecatbehindthemask Mar 25 '15 at 20:51