Removing duplicates in a string in Python

Question

What is an efficient algorithm to removing all duplicates in a string?

For example : aaaabbbccdbdbcd

Required result: abcd

Do you need to maintain / impose order on the result? – Rob Fonseca-Ensor Feb 18 '10 at 07:13 — Rob Fonseca-Ensor, Feb 18 '10 at 07:13

score 19 · Accepted Answer · answered Feb 18 '10 at 07:12

19

You use a hashtable to store currently discovered keys (access O(1)) and then loop through the array. If a character is in the hashtable, discard it. If it isn't add it to the hashtable and a result string.

Overall: O(n) time (and space).

The naive solution is to search for the character is the result string as you process each one. That O(n²).

answered Feb 18 '10 at 07:12

cletus

616,129
168
910
942

+1, Or if they have accss to it HashSet http://msdn.microsoft.com/en-us/library/bb495294.aspx – Adriaan Stander Feb 18 '10 at 07:14
1

If you have a large string compared the possible # vakues of the characters (eg like if it is ASCII), you might use a =n array of bools instead on a hashtable – Ritsaert Hornstra Feb 18 '10 at 08:10
1

The best case to retrieve a value from hashtable is O(1) and the worst case O(n). The overall worst case complexity for the algorithm is O(n^2). – Thomas Jung Feb 18 '10 at 08:47
that is irrelavent in this case as you by definition of the algo have either 0 or 1 item for each hash key – jk. Feb 18 '10 at 09:07
@jk The hashtable has always 0 or 1 entries for a key. The worst case O(n) is that all n values are in one bucket. – Thomas Jung Feb 18 '10 at 09:47
@Thomas Jung: For this problem computing a Perfect Hashing function is easy (typically the ASCII or at worst the Unicode Code Point value) therefore you perform access in `O(1)`. – Matthieu M. Feb 19 '10 at 15:29
@Matthieu Not Exactly. Suppose you have perfect hash function from char -> 2 byte Int. This is easy. Now your Hashtable size is smaller than 2^16. Say 15. When you enter 2 values it is quite probable that you will have a collision (1/15 for the second value). The index is some complicated version of idx = hash % size. If you want absolutely no collisions you have to create a hashtable of size 2^16. – Thomas Jung Feb 19 '10 at 15:44
@Matthieu I've realized that I've cut corners a bit. You can of course create a perfect hash function for hashtables with size < 2^16. Cuckoo hashing has for example O(1) worst case access complexity but worst case O(n) for puts. I suppose there is no hashtable that has worst case complexity of O(1) for all operations. – Thomas Jung Feb 19 '10 at 17:56
It depends on the input size: if you can manage to have an upper-bound for the size of a bucket, then you can always pretend to be `O(1)` even though it could be daunting :x Here it seems easy enough for ASCII charachters (256 of them) and of course a bit more difficult if you wish to take all the Unicode Points into account, yet with a sufficiently big bitset you could have good performance without too much memory (server-scale) – Matthieu M. Feb 20 '10 at 12:48

John La Rooy · Answer 2 · 2010-02-25T00:16:03.863

5

In Python

>>> ''.join(set("aaaabbbccdbdbcd"))
'acbd'

If the order needs to be preserved

>>> q="aaaabbbccdbdbcd"                    # this one is not
>>> ''.join(sorted(set(q),key=q.index))    # so efficient
'abcd'

or

>>> S=set()
>>> res=""
>>> for c in "aaaabbbccdbdbcd":
...  if c not in S:
...   res+=c
...   S.add(c)
... 
>>> res
'abcd'

or

>>> S=set()
>>> L=[]
>>> for c in "aaaabbbccdbdbcd":
...  if c not in S:
...   L.append(c)
...   S.add(c)
... 
>>> ''.join(L)
'abcd'

In python3.1

>>> from collections import OrderedDict
>>> ''.join(list(OrderedDict((c,0) for c in "aaaabbbccdbdbcd").keys()))
'abcd'

edited Feb 25 '10 at 00:16

answered Feb 18 '10 at 07:51

John La Rooy

295,403
53
369
502

I knew set would be awesome for this, but I'm new to python and was trying to figure out how to join them while you posted this... Now I know! – Carson Myers Feb 18 '10 at 07:55
@recursive, I added some order preserving options – John La Rooy Feb 25 '10 at 00:17

score 5 · Answer 3 · edited May 23 '17 at 12:01

5

This closely related to the question: Detecting repetition with infinite input.

The hashtable approach may not be optimal depending on your input. Hashtables have a certain amount of overhead (buckets, entry objects). It is huge overhead compared to the actual stored char. (If you target environment is Java it is even worse as the HashMap is of type Map<Character,?>.) The worse case runtime for a Hashtable access is O(n) due to collisions.

You need only 8kb too represent all 2-byte unicode characters in a plain BitSet. This may be optimized if your input character set is more restricted or by using a compressed BitSets (as long as you have a sparse BitSet). The runtime performance will be favorable for a BitSet it is O(1).

edited May 23 '17 at 12:01

Community

1
1

answered Feb 18 '10 at 08:28

Thomas Jung

32,428
9
84
114

I am afraid to mention that you are mixing (somehow) concepts and implementations. I view the fact that you are using a `BitSet` to implement your own `HashTable` as a proof that the `HashTable` is a perfectly viable solution. – Matthieu M. Feb 19 '10 at 15:34
1

@Matthieu Using a Hashtable or a BitSet has certain trade-offs. The hashtable works best for small sets of characters. The BitSet works best when the number of characters is large or can be restricted to a known range. A BitSet is not a Hashtable. The Hashtable here is used as a Set as someone mentioned. The BitSet is used analogous. If you can replace one with the other does not mean that they are equally good solutions. – Thomas Jung Feb 19 '10 at 15:54

Stano · Answer 4 · 2012-07-02T18:38:12.867

PHP algorythm - O(n):

function remove_duplicate_chars($str) {
    if (2 > $len = strlen($str)) {
        return $str;
    }
    $flags = array_fill(0,256,false);
    $flags[ord($str[0])]=true;
    $j = 1;
    for ($i=1; $i<$len; $i++) {
        $ord = ord($str[$i]);
        if (!$flags[$ord]) {
            $str[$j] = $str[$i];
            $j++;
            $flags[$ord] = true;
        }
    }
    if ($j<$i) { //if duplicates removed
        $str = substr($str,0,$j);
    }
    return $str;
}

echo remove_duplicate_chars('aaaabbbccdbdbcd'); // result: 'abcd'

score 2 · Answer 5 · edited Feb 05 '14 at 18:19

You Can Do this in O(n) only if you are using HashTable. Code is given below Please Note- It is assumed that number of possible characters in input string are 256

void removeDuplicates(char *str)
{
 int len = strlen(str); //Gets the length of the String
 int count[256] = {0};  //initializes all elements as zero
 int i;
     for(i=0;i<len;i++)
     {
        count[str[i]]++;  
        if(count[str[i]] == 1)
          printf("%c",str[i]);                  
     }     
}

score 2 · Answer 6 · answered Feb 18 '10 at 07:13

2

Keep an array of 256 "seen" booleans, one for each possible character. Stream your string. If you haven't seen the character before, output it and set the "seen" flag for that character.

answered Feb 18 '10 at 07:13

SPWorley

11,550
9
43
63

1

It has not been told what coding is used, though – Feb 18 '10 at 07:19

score 1 · Answer 7 · answered Jul 14 '13 at 20:24

1

#include <iostream>
#include<string>
using namespace std;
#define MAX_SIZE 256

int main()
{
    bool arr[MAX_SIZE] = {false};

    string s;
    cin>>s;
    int k = 0;

    for(int i = 0; i < s.length(); i++)
    {
        while(arr[s[i]] == true && i < s.length())
        {
            i++;
        }
        if(i < s.length())
        {
            s[k]    = s[i];
            arr[s[k]] = true;
            k++;
        }
    }
    s.resize(k);

    cout << s<< endl; 

    return 0;
}

answered Jul 14 '13 at 20:24

TheMan

703
8
11

What does the `arr[s[i]] == true` mean? – syntagma Nov 19 '13 at 20:47
Just having arr[s[i]] there would do too. Is that what you meant? – TheMan Nov 19 '13 at 20:54
`arr[s[i]]` is what I'm interested in - do I understand correctly that you use `char` (which is an `int`) to index the array? – syntagma Nov 19 '13 at 20:58
Yes, that's right and it should work. What's the issue with that? – TheMan Nov 23 '13 at 21:21

score 0 · Answer 8 · edited Nov 07 '12 at 07:31

import java.util.HashSet;

public class RemoveDup {

    public static String Duplicate()
    {
        HashSet h = new HashSet();
        String value = new String("aaaabbbccdbdbcd");
        String finalString = new String();
        int stringLength = value.length();
        for (int i=0;i<=stringLength-1;i++)
        {
            if(h.add(value.charAt(i)))
            {
                finalString = finalString + (value.charAt(i));
            }


        }
        return finalString;

    }
public static void main(String[] args) {


        System.out.println(Duplicate());
    }
}

score 0 · Answer 9 · answered Mar 29 '13 at 19:06

get a list of first 26 prime numbers.. Now you can map each character (a,b,c,d etc) to each prime number.. (alphabetically say a=2, b=3, c=5 etc.. or depending upon relative abundance of the characters like most frequently used letter with lower prime say e=2, r=3, a=5 etc)...store that mapping in an integer array int prime[26]..

iterate through all the characters of the string

i=0;
int product = 1;
while(char[i] != null){
   if(product % prime[i] == 0)
      the character is already present delete it
   else
      product = product*prime[i];
}

this algorithm will work in O(n) time.. with O(1) space requirement It will work well when number of distinct character are less in the string... other wise product will exceed "int" range and we have to handle that case properly

Amgad Fahmi · Answer 10 · 2010-02-18T07:45:26.843

0

  string newString = new string("aaaaabbbbccccdddddd".ToCharArray().Distinct().ToArray());

or

 char[] characters = "aaaabbbccddd".ToCharArray();
                string result = string.Empty ;
                foreach (char c in characters)
                {
                    if (result.IndexOf(c) < 0)
                        result += c.ToString();
                }

edited Feb 18 '10 at 07:45

answered Feb 18 '10 at 07:13

Amgad Fahmi

4,349
3
19
18

1

O(n^2) isn't very efficient... (once the data set gets big enough). For small strings this is probably faster than a hashset based lookup though – Rob Fonseca-Ensor Feb 18 '10 at 07:14
i agree , what about the new one ? – Amgad Fahmi Feb 18 '10 at 07:23
String concatenation in a loop will be slower than the searching of the character within the string... – cjk Feb 18 '10 at 08:53

score 0 · Answer 11 · answered Feb 18 '10 at 08:48

In C++, you'd probably use an std::set:

std::string input("aaaabbbccddd");
std::set<char> unique_chars(input.begin(), input.end());

In theory you could use std::unordered_set instead of std::set, which should give O(N) expected overall complexity (though O(N²) worst case), where this one is O(N lg M) (where N=number of total characters, M=number of unique characters). Unless you have long strings with a lot of unique characters, this version will probably be faster though.

score 0 · Answer 12 · answered Feb 24 '10 at 23:48

You can sort the string and then remove the duplicate characters.

#include <iostream>
#include <algorithm>
#include <string>

int main()
{
    std::string s = "aaaabbbccdbdbcd";

    std::sort(s.begin(), s.end());
    s.erase(std::unique(s.begin(), s.end()), s.end());

    std::cout << s << std::endl;
}

score 0 · Answer 13 · answered Feb 25 '10 at 00:09

0

This sounds like a perfect use for automata.

answered Feb 25 '10 at 00:09

jbrennan

11,943
14
73
115

score 0 · Answer 14 · answered Feb 25 '10 at 00:26

0

C++ - O(n) time, O(1) space, and the output is sorted.

std::string characters = "aaaabbbccddd";
std::vector<bool> seen(std::numeric_limits<char>::max()-std::numeric_limits<char>::min());

for(std::string::iterator it = characters.begin(), endIt = characters.end(); it != endIt; ++it) {
  seen[(*it)-std::numeric_limits<char>::min()] = true;
}

characters = "";
for(char ch = std::numeric_limits<char>::min(); ch != std::numeric_limits<char>::max(); ++ch) {
  if( seen[ch-std::numeric_limits<char>::min()] ) {
    characters += ch;
  }
}

answered Feb 25 '10 at 00:26

JoeG

12,994
1
38
63

O(1) space??? I see a vector of bools.. Isnt is same as an array of bools of 256 characters?? – letsc Aug 10 '11 at 04:58
@smartmuki: It's O(1) space because the size of the `vector` does not vary according to the size of the input - it's 256 bools no matter what the input is. – JoeG Aug 10 '11 at 12:34

score 0 · Answer 15 · answered Nov 28 '14 at 16:11

int main()    
{    
    std::string s = "aaacabbbccdbdbcd";

    std::set<char> set1;
    set1.insert(s.begin(), s.end());

    for(set<char>::iterator it = set1.begin(); it!= set1.end(); ++it)
    std::cout << *it;

    return 0;
}

std::set takes O(log n) to insert

score 0 · Answer 16 · answered Sep 11 '16 at 02:31

O(n) solution:

#include<stdio.h>
#include<string.h>
#include<stdlib.h>

void removeDuplicates(char *);

void removeDuplicates(char *inp)
{
        int i=0, j=0, FLAG=0, repeat=0;

     while(inp[i]!='\0')
    {
        if(FLAG==1)
        {
                inp[i-repeat]=inp[i];
        }
        if(j==(j | 1<<(inp[i]-'\0')))
        {
                repeat++;
                FLAG=1;
        }
                j= j | 1<<(inp[i]-'\0');
                i++;
    }

     inp[i-repeat]='\0';
}

int main()
{
     char inp[100] = "aaAABCCDdefgccc";
    //char inp[100] = "ccccc";
    //char inp[100] = "\0";
    //char *inp = (char *)malloc(sizeof(char)*100);

    printf (" INPUT STRING : %s\n", inp);

     removeDuplicates(inp);

    printf (" OUTPUT STRING : %s:\n", inp);
    return 1;
}

score 0 · Answer 17 · answered May 22 '19 at 09:17

Perhaps the use of built in Python functions are more efficient that those "self made". Like this:

=====================

NOTE: maintain order

CODE

string = "aaabbbccc"

product = reduce((lambda x,y: x if (y in x) else x+y), string)

print product

OUTPUT

abc

=========================

NOTE: order neglected

CODE

string = "aaabssabcdsdwa"

str_uniq = ''.join(set(string))

print str_uniq

OUTPUT

acbdsw

score 0 · Answer 18 · answered May 04 '11 at 17:33

in C this is how i did it: O(n) in time since we only have one for loop.

void remDup(char *str)
{
    int flags[256] = { 0 };

    for(int i=0; i<(int)strlen(str); i++) {
        if( flags[str[i]] == 0 )
            printf("%c", str[i]);

        flags[str[i]] = 1;
    }
}

score -1 · Answer 19 · answered May 03 '20 at 09:44

-1

# by using python
def cleantext(word):
    if(len(word)==1):

        return word
    if word[0]==word[1]:

        return cleantext(word[1:])

return word[0]+ cleantext(word[1:])
print(cleantext(word))

answered May 03 '20 at 09:44

Mousa Mohammed

1
1

Removing duplicates in a string in Python

19 Answers19

Linked