2

I'm writing a function in C that will take a string and remove all characters that are not a lowercase alphabetic character. I have this code written so far:

void strclean(char* str) {
   while (*str) {
      if (!(*str >= 'a' && *str <= 'z')) {
         strcpy(str, str + 1);
         str--;
      }
      str++;
   }
}

When I pass it the string "hello[][]world" the function seems to mostly work, except the output is:

hellowoldd

When I make it print after every line that it enters the if statement, here is the output I receive:

hello][]woldd
hello[]woldd
hello]woldd
hellowoldd

It seems to be really close but I can't understand why it would be producing this output! The weirdest part is I have given the code to two other friends and it works fine on their computers. We are all running the same version of Linux (ubuntu 14.04.3), and are all using gcc to compile.

I'm not sure if there is an issue with the code that would cause inconsistency of the output, or if it's a compiler issue that's creating the problem. Maybe it has to do with strcpy on my machine compared to theirs?

Chris
  • 87
  • 2
  • 7

2 Answers2

10

The strcpy function is not guaranteed to work if the ranges overlap, as they do in your case. From C11 7.24.2.3 The strcpy function /2 (my emphasis):

The strcpy function copies the string pointed to by s2 (including the terminating null character) into the array pointed to by s1. If copying takes place between objects that overlap, the behavior is undefined.

You can use something like memmove, which does work with overlapping ranges, as per C11 7.24.2.2 The memmove function /2:

The memmove function copies n characters from the object pointed to by s2 into the object pointed to by s1. Copying takes place as if the n characters from the object pointed to by s2 are first copied into a temporary array of n characters that does not overlap the objects pointed to by s1 and s2, and then the n characters from the temporary array are copied into the object pointed to by s1.


But there's a better solution that's O(n) rather than O(n2) in time complexity, while still being overlap-safe:

void strclean (char* src) {
    // Run two pointers in parallel.

    char *dst = src;

    // Process every source character.

    while (*src) {
        // Only copy (and update destination pointer) if suitable.
        // Update source pointer always.

        if (islower(*src)) *dst++ = *src;
        src++;
    }

    // Finalise destination string.

    *dst = '\0';
}

You'll notice I'm also using islower() (from ctype.h) to detect lower case alphabetic characters. This is more portable since the C standard does not mandate that the alpha characters have consecutive code points (the digits are the only ones guaranteed to be consecutive).

There's also no separate need to check for isalpha() since, as per C11 7.4.1.2 The isalpha function /2, islower() == true implies isalpha() == true:

The isalpha function tests for any character for which isupper or islower is true, or ...

paxdiablo
  • 854,327
  • 234
  • 1,573
  • 1,953
  • 2
    If `islower` returns "true", isn't it a given that `isalpha` would also return "true"? Meaning you don't need the `isalpha` check? – Some programmer dude Sep 25 '15 at 07:37
  • @Joachim, you're correct since isalpha is true for, at a minimum, everything that `islower` or `isupper` is true for. I'll adjust the answer. – paxdiablo Sep 25 '15 at 07:43
4

from N1256 7.21.2.3 The strcpy function

If copying takes place between objects that overlap, the behavior is undefined.

memmove can be used even if the area is overlapped.

void strclean(char* str) {
   while (*str) {
      if (!islower(*str)) { /* include ctype.h to use islower function */
         memmove(str, str + 1, strlen(str)); /* strlen(str + 1) + 1 (for terminating null character) should be strlen(str) */
      } else {
         str++;
      }
   }
}

Since it's undefined behaviour to subtract from a pointer so that it points to an area before the array, I've also restructured the str manipulation.

paxdiablo
  • 854,327
  • 234
  • 1,573
  • 1,953
MikeCAT
  • 73,922
  • 11
  • 45
  • 70
  • Thank you! This answer is a great concise answer because it hardly changes my code. I didn't know about memmove. I think the person above gave me good improvements to my code, but I liked your answer for it detailing my problem specifically. – Chris Sep 26 '15 at 23:15
  • MikeCAT, as per BLUEPIXY comment on question, it's undefined behaviour to subtract from a pointer that's pointing to the start of an array. You could get rid of that relatively simply by removing the `str--` and putting the `str++` into an `else` block. I'm *assuming* it was also an oversight to leave the `strcpy` in there. I'll make those changes but, if you're unhappy, feel free to improve. Doesn't get rid of the O(n^2) nature but it's still a useful answer for all but strings of *extreme* length (which are few and far between). – paxdiablo Sep 27 '15 at 03:45