3

I found an algorithm here to remove duplicate characters from string with O(1) space complexity (SC). Here we see that the algorithm converts string to character array which is not constant, it will change depending on input size. They claim that it will run in SC of O(1). How?

// Function to remove duplicates 
static string removeDuplicatesFromString(string string1) 
{ 

    // keeps track of visited characters 
    int counter = 0; 
    char[] str = string1.ToCharArray(); 
    int i = 0; 
    int size = str.Length; 

    // gets character value 
    int x; 

    // keeps track of length of resultant String 
    int length = 0; 

    while (i < size) { 
        x = str[i] - 97; 

        // check if Xth bit of counter is unset 
        if ((counter & (1 << x)) == 0) { 

            str[length] = (char)('a' + x); 

            // mark current character as visited 
            counter = counter | (1 << x); 

            length++; 
        } 
        i++; 
    } 

    return (new string(str)).Substring(0, length); 
} 

It seems that I don't understand Space Complexity.

Moshi
  • 1,385
  • 2
  • 17
  • 36
  • 2
    Is this Java? C#? Please tag with an appropriate language. Regarding the algorithm, the character array is not growing during the running of the algorithm, so it takes constant space. Hence, O(1). What part is unclear to you? Are you perhaps confusing space with time complexity? – Cody Gray - on strike May 21 '19 at 19:50
  • 1
    @CodyGray actually language doesn't matter here. – Moshi May 21 '19 at 19:54
  • It does; you've posted a large chunk of code in your question. – Cody Gray - on strike May 23 '19 at 20:51

1 Answers1

3

I found an algorithm here to remove duplicate characters from string with O(1) space complexity (SC). Here we see that the algorithm converts string to character array which is not constant, it will change depending on input size. They claim that it will run in SC of O(1). How?

It does not.

The algorithm takes as its input an arbitrary sized string consisting only of 26 characters, and therefore the output is only ever 26 characters or fewer, so the output array need not be of the size of the input.

You are correct to point out that the implementation given on the site allocates O(n) extra space unnecessarily for the char array.

Exercise: Can you fix the char array problem?

Harder Exercise: Can you describe and implement a string data structure that implements the contract of a string efficiently but allows this algorithm to be implemented actually using only O(1) extra space for arbitrary strings?

Better exercise: The fact that we are restricted to an alphabet of 26 characters is what enables the cheesy "let's just use an int as a set of flags" solution. Instead of saying that n is the size of the input string, what if we allow arbitrary sequences of arbitrary values that have an equality relation; can you come up with a solution to this problem that is O(n) in the size of the output sequence, not the input sequence?

That is, can you implement public static IEnumerable<T> Distinct<T>(this IEnumerable<T> t) such that the output is deduplicated but otherwise in the same order as the input, using O(n) storage where n is the size of the output sequence?

This is a better exercise because this function is actually implemented in the base class library. It's useful, unlike the toy problem.

I note also that the problem statement assumes that there is only one relevant alphabet with lowercase characters, and that there are 26 of them. This assumption is false.

Eric Lippert
  • 647,829
  • 179
  • 1,238
  • 2,067
  • 1
    The existing `String` data type could implement the algorithm in O(1) space if the it included a generic static method that accepted an integer `n` and an argument constrained to `IEnumerator` or similar interface, and stored the first `n` values returned from the enumerator into a new string of length `n`. Since no reference to the new string would ever be exposed to any outside code that could modify it after exposure, all immutability guarantees would remain intact. – supercat May 21 '19 at 20:15
  • 1
    `Exercise: Can you fix the char array problem?` Specifically for this algorithm, here no need to declare a character array of that string. If we assume we only have 256 characters then we can declare fixed `char[256] array` and do the operation. Need some minor changes in algorithm. – Moshi May 21 '19 at 20:18
  • 1
    @Moshii: The concept here is presumably that the input string consists only of 26 characters, so the output is never more than 26 characters. Where does the number 256 come from in your comment? – Eric Lippert May 21 '19 at 20:27
  • @EricLippert I assumed because if we want the algorithm to support `Extended ASCII, which supports 8 bit values, or 256 characters`, we need 256 characters only. – Moshi May 21 '19 at 20:48
  • 1
    @Moshii: Are you a time traveller from the 1970s? :) C# chars are 16 bits; there are 65536 of them. We use Unicode in modern code, not ASCII. – Eric Lippert May 21 '19 at 20:54
  • @EricLippert My bad, so we need `2^16` constant memory which is still fixed sized compared to `.ToCharArray`. – Moshi May 21 '19 at 20:58
  • @Moshii: OK, can you solve the problem on *sequences of 64 bit ints*, where we relax the restriction to being O(n) in the size of the output sequence? This business of "let's just make an array of bools of the number of possible input elements" only works when the number of possible input levels is very small. That's the sort of assumption that you make in a toy problem; it's a better exercise to solve a real-world problem. – Eric Lippert May 21 '19 at 21:24
  • @EricLippert To allocate `2^16` bit memory there is a catch here. If the input length is less than 65536 in an average case, `ToCharArray` is better. – Moshi May 21 '19 at 21:27
  • Is this the answer for the "better exercise"? `var unique = new String("geeksforgeeks".Distinct().ToArray())` – Theodor Zoulias May 21 '19 at 21:30
  • @EricLippert If space complexity O(n), we can use `Dictionary` data structure. Where _64 bit ints_ are stored as string format. Easy. – Moshi May 21 '19 at 21:34
  • 1
    @TheodorZoulias: And what if you were the person tasked with implementing `Distinct`? *Someone* had to write that code; it didn't just spring into being on its own. How would you write it? That's the challenge. – Eric Lippert May 22 '19 at 00:05
  • 1
    It was a challenge for the first guy who implemented the method. Now it's just a matter of reaching at the [source code](https://referencesource.microsoft.com/system.core/system/linq/Enumerable.cs.html#0ff831f61400afdf), and studying these 3 lines of code on all their glory. :-) – Theodor Zoulias May 22 '19 at 01:09
  • @TheodorZoulias If you look at the code [link](https://referencesource.microsoft.com/#system.core/system/linq/Enumerable.cs,9c10b234c0932864), they initiate a set which incrementally stores all the distinct elements. Though you don't have to care about the algorithm, your memory will be consumed. – Moshi May 22 '19 at 04:07
  • 1
    @Moshii the set will store at maximum 65535 chars, which will happen if the text is extremely diverse. In practice the memory consumption will be no more than 100 chars, most of the time. This method still fails for characters outside the [Basic Multilingual Plane](https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane) though. For example `new String("".Distinct().ToArray())` is messed up. – Theodor Zoulias May 22 '19 at 08:21
  • @TheodorZoulias: I was hoping someone would mention that! – Eric Lippert May 22 '19 at 12:03