Remove duplicates in string algorithm

Question

My homework is remove duplicates in a random string. My idea is use 2 loops to solve the problem.

1st one will scan every character in the string. 2nd one will check that character is duplicated or not. If so, remove the character.

string content = "Blah blah..."

    for (int i = 0; i < content.size(); ++i) {
            char check = content.at(i);
            for (int j = i + 1; j < content.size() - 1; ++j) {
                if (check == content.at(j)) {
                    content.erase(content.begin()+j);

                }
            }
        }

The problem is it doesn't work. It always removes the wrong character. Seems an indices problem but I don't understand why.

A temporary fix is change content.erase(content.begin()+j); to content.erase( remove(content.begin() + i+1, content.end(), check),content.end());

But I think trigger a "remove by value" scan isn't a nice way. I want to do it with 2 loops or fewer.

Any ideas will be appreciated :)

One simple way is to use the double-loop brute-force variant you have, but copy non-duplicates to a new string. — Some programmer dude, Jun 30 '14 at 17:58
Potential duplicate: http://stackoverflow.com/questions/2286860/removing-duplicates-in-a-string — djikay, Jun 30 '14 at 18:00
@JoachimPileborg: I like your idea, but can you provide some pseudo code? I can't implement it right now, no idea :( — Tiana987642, Jun 30 '14 at 18:09
@djikay:thank for the link, but can you show me what's wrong in my code? I can't understand why the indices are wrong. — Tiana987642, Jun 30 '14 at 18:10
@Tiana987642 See my answer that to understand what is wrong in your code. — Vlad from Moscow, Jun 30 '14 at 20:14

score 5 · Answer 1 · answered Jun 30 '14 at 18:14

5

If use of STL is a possible option, you could use an std::unordered_set to keep the characters seen so far and the erase-remove idiom with std::remove_if, like in the following example:

#include <iostream>
#include <string>
#include <unordered_set>
#include <algorithm>

int main() {
  std::string str("Hello World!");
  std::unordered_set<char> log;
  std::cout << "Before: " << str << std::endl;
  str.erase(std::remove_if(str.begin(), str.end(), [&] (char const c) { return !(log.insert(c).second); }), str.end());
  std::cout << "After:  " << str << std::endl;
}

LIVE DEMO

answered Jun 30 '14 at 18:14

101010

41,839
11
94
168

Nice idea :) Of course use STL is an option, but I want to improve my skill a bit so I try to think an algorithm first, use `` later :) Just my opinion – Tiana987642 Jul 01 '14 at 13:42
1

@Tiana987642 if it's for educational purposes I'm with you 100%. But when you really want to do serious programming, STL is not only highly recommended, but IMHO its use is mandatory. – 101010 Jul 01 '14 at 13:46
Thank, I appreciate your advice :) – Tiana987642 Jul 01 '14 at 14:21

score 5 · Accepted Answer · answered Jun 30 '14 at 19:53

Your loops could look the following way

#include <iostream>
#include <string>

int main() 
{
    std::string s = "Blah blah...";

    std::cout << '\"' << s << '\"' << std::endl;

    for ( std::string::size_type i = 0; i < s.size(); i++ )
    {
        std::string::size_type j = i + 1;
        while ( j < s.size() )
        {
            if ( s[i] == s[j] )
            {
                s.erase( j, 1 );
            }
            else
            {
                ++j;
            }
        }
    }

    std::cout << '\"' << s << '\"' << std::endl;

    return 0;
}

The output is

"Blah blah..."
"Blah b."

There are many other approaches using standard algorithms. For example

#include <iostream>
#include <string>
#include <algorithm>
#include <iterator>

int main() 
{
    std::string s = "Blah blah...";

    std::cout << '\"' << s << '\"' << std::endl;

    auto last = s.end();

    for ( auto first = s.begin(); first != last; ++first )
    {
        last = std::remove( std::next( first ), last, *first );
    }

    s.erase( last, s.end() );

    std::cout << '\"' << s << '\"' << std::endl;

    return 0;
}

The output is the same as for the previous code example

"Blah blah..."
"Blah b."

Thank you for your suggestion :) I learnt a few thing from your code :) If you have free time, please tell me my what's wrong in my code and workaround for it, like you mentioned in your comment :) — Tiana987642, Jul 01 '14 at 14:49
@Tiana987642 I think that it is enough to compare your loops with my loops in my first code example. It is obvious that this loop for (int j = i + 1; j < content.size() - 1; ++j) is invalid. It is not clear why condition content.size() - 1 is used and secondly when a character was deleted you must not increment i. — Vlad from Moscow, Jul 01 '14 at 14:54
My idea is the first loop take a character (let's call it A) , the second will compare A with the rest and delete. So the 2nd will start after the position of A and is ended at the last of the string. At the time I wrote the code I didn't think carefully about this. Just don't want to out of bound. Thank you for point out my mistake :) — Tiana987642, Jul 01 '14 at 15:08

score 3 · Answer 3 · answered Jun 30 '14 at 19:56

I recommend a two pass approach. The first pass identifies the positions of the duplicated characters; the second pass removes them.

I recommend using a std::set and a std::vector<unsigned int>. The vector contains letters that are in the string. The vector contains the positions of the duplicated letters.

The first pass detects if a letter is present in the set. If the letter exists, the position is appended to the vector. Otherwise the letter is inserted into the set.

For the second pass, sort the vector in descending order.
Erase the character at the position in the vector, then remove the position from the vector.

By erasing characters from the end of the string towards the front, the positions of the remaining duplicates won't change when the character is erased from the string.

I ain't forget who helped me :) I already voted up your answer :) I just want to be fair with people, who also answer this question :D — Tiana987642, Jul 01 '14 at 15:36

score 1 · Answer 4 · answered Jun 30 '14 at 18:11

I am not sure that this is what is causing your problem, but another problem that I see with your code is in your second for loop. Your j < content.size() - 1 statement should just be

j < content.size().

The reasoning for this is a little tricky to see at first, but in this case you are not just getting the size of your vector to act as the size, but to act as the ending indices of your string. You are shortening the last indices by one which means you wont hit the last char in your string. I don't know if this will help your initial problem, but who knows?

score 1 · Answer 5 · answered Jun 30 '14 at 18:21

1

Note: Your actual problem is maintaining a proper index to the next element in question:

If you do not erase a character, the next element is at the next position.
If you erase a character, the next element will move into the place of the current position (the position stays the same).

Also: There are more efficient solutions (eg.: utilizing a set)

answered Jun 30 '14 at 18:21

Now I understand. So can you give me a workaround to avoid situation like this? – Tiana987642 Jul 01 '14 at 13:46

xitong · Answer 6 · 2023-04-09T04:23:39.553

1

sort first

then unique move all unique chars to the beginning and return past-the-end iterator.

then erase the useless chars

string digits("1a2b3c3c2b1a");
sort(digits.begin(), digits.end());
digits.erase( unique( digits.begin(), digits.end() ), digits.end() );
cout << digits << endl;

output

123abc

edited Apr 09 '23 at 04:23

answered Apr 09 '23 at 04:13

xitong

352
2
9

Remove duplicates in string algorithm

6 Answers6

LIVE DEMO

Linked