Remove white chars between commas, but not between what inside the commas

Question

I'm new to C and learning C90. I'm trying to parse a string into a command, But I have a hard time trying to remove white chars.

My goal is to parse a string like this:

NA ME, NAME   , 123 456, 124   , 14134, 134. 134   ,   1

into this:

NA ME,NAME,123 456,124,14134,134. 134,1

so the white chars that were inside the arguments are still there, but the other white chars are removed.

I thought about using strtok, but I still want to keep the commas, even if there are multiple consecutive commas.

Until now I used:

void removeWhiteChars(char *s)
{
    int i = 0;
    int count = 0;
    int inNum = 0;
    while (s[i])
    {
        if (isdigit(s[i]))
        {
            inNum = 1;
        }
        if (s[i] == ',')
        {
            inNum = 0;
        }
        if (!isspace(s[i]) && !inNum)
            s[count++] = s[i];
        else if (inNum)
        {
            s[count++] = s[i];
        }

        ++i;
    }
    s[count] = '\0'; /* adding NULL-terminate to the string */
}

But it only skips for numbers and does not remove white chars after the number until the comma, and it's quite wrong.

i would appreciate any kind of help, I'm stuck on this one for two days now.

So, in simpler terms, remove all whitespace immediately preceding, or following, a comma. — WhozCraig, May 09 '22 at 18:15
@WhozCraig Yes! Kinda, But keeping the whitespace between the argument chars — Sheilem, May 09 '22 at 18:20
I understand, thus my comment. The spaces you're talking about do not immediately precede, nor follow, a comma. `"A B , C D"` , only the spaces on either side of the comma are to be removed; right? — WhozCraig, May 09 '22 at 18:23
I think you're on the right track. It's basically a collapsing reader/writer algorithm with a little state-machine on whether all whitespace immediately preceding, or following, a comma, is contiguous. — WhozCraig, May 09 '22 at 18:36
I don't know if this would help you, but instead of thinking just "remove the whitespace around the commas", approach this problem by trying replacing all occurrences of the sequence **0 or more spaces, 1 comma, 0 or more spaces** by a single comma might be easier to implement. — Uduru, May 09 '22 at 19:02
@Uduru Interesting approach, I will have to think about this for a little bit — Sheilem, May 09 '22 at 19:07
If you are only interested in commas and whitespace, why do you check for digits? They seem totally irrelevant. — rici, May 09 '22 at 19:49

score 5 · Accepted Answer · answered May 09 '22 at 19:50

You need to do lookaheads whenever you encounter possible skippable whitespace. The function below, every time it sees a space, checks forward if it ends with a comma. Likewise, for every comma, it checks and removes all following spaces.

// Remove elements str[index] to str[index+len] in place
void splice (char * str, int index, int len) {
  while (str[index+len]) {
    str[index] = str[index+len];
    index++;
  }
  str[index] = 0;
}

void removeWhiteChars (char * str) {
  int index=0, seq_len;

  while (str[index]) {
    if (str[index] == ' ') {
      seq_len = 0;

      while (str[index+seq_len] == ' ') seq_len++;

      if (str[index+seq_len] == ',') {
        splice(str, index, seq_len);
      }
    }
    if (str[index] == ',') {
      seq_len = 0;
      while (str[index+seq_len+1] == ' ') seq_len++;

      if (seq_len) {
        splice(str, index+1, seq_len);
      }
    }
    index++;
  }
}

So elegant, After a few comments here I thought about doing it, But didn't knew how to implement it, It's amazing! — Sheilem, May 09 '22 at 21:16

David C. Rankin · Answer 2 · 2022-05-09T23:02:15.563

A short and reliable way to approach any parsing problem is to use a state-loop which is nothing more than a loop over all the characters in your original string where you use one (or more) flag variables to keep track of the state of anything you need to track. In your case here, you need to know the state of whether you are reading post (after) the comma.

This controls how you handle the next character. You will use a simple counter variable to keep track of the number of spaces you have read, and when you encounter the next character, if you are not post-comma, you append that number of spaces to your new string. If you are post-comma, you discard the buffered spaces. (you can use encountering the ',' itself as a flag that need not be kept in a variable).

To remove spaces around the ',' delimiter, you can write a rmdelimws() function that takes the new string to fill and the old string to copy from as arguments and do something similar to:

void rmdelimws (char *newstr, const char *old)
{
  size_t spcount = 0;               /* space count */
  int postcomma = 0;                /* post comma flag */
  
  while (*old) {                    /* loop each char in old */
    if (isspace (*old)) {           /* if space? */
      spcount += 1;                 /* increment space count */
    }
    else if (*old == ',') {         /* if comma? */
      *newstr++ = ',';              /* write to new string */
      spcount = 0;                  /* reset space count */
      postcomma = 1;                /* set post comma flag true */
    }
    else {                          /* normal char? */
      if (!postcomma) {             /* if not 1st char after comma */
        while (spcount--) {         /* append spcount spaces to newstr */
          *newstr++ = ' ';
        }
      }
      spcount = postcomma = 0;      /* reset spcount and postcomma */
      *newstr++ = *old;             /* copy char from old to newstr */
    }
    old++;                          /* increment pointer */
  }
  *newstr = 0;                      /* nul-terminate newstr */
}

(note: updated to affirmatively nul-terminate if newstr wasn't initialized all zero as shown below)

If you want to save the trailing whitespace in the line (e.g. spaces after the ending 1 in your example), you can add the following before nul-terminating the string above:

  if (!postcomma) {                 /* if tailing whitespace wanted */
    while (spcount--) {             /* append spcount spaces to newstr */
      *newstr++ = ' ';
    }
  }

Putting it together is a short example you would have:

#include <stdio.h>
#include <ctype.h>

void rmdelimws (char *newstr, const char *old)
{
  size_t spcount = 0;               /* space count */
  int postcomma = 0;                /* post comma flag */
  
  while (*old) {                    /* loop each char in old */
    if (isspace (*old)) {           /* if space? */
      spcount += 1;                 /* increment space count */
    }
    else if (*old == ',') {         /* if comma? */
      *newstr++ = ',';              /* write to new string */
      spcount = 0;                  /* reset space count */
      postcomma = 1;                /* set post comma flag true */
    }
    else {                          /* normal char? */
      if (!postcomma) {             /* if not 1st char after comma */
        while (spcount--) {         /* append spcount spaces to newstr */
          *newstr++ = ' ';
        }
      }
      spcount = postcomma = 0;      /* reset spcount and postcomma */
      *newstr++ = *old;             /* copy char from old to newstr */
    }
    old++;                          /* increment pointer */
  }
  *newstr = 0;                      /* nul-terminate newstr */
}


int main (void) {
  
  char str[] = "NA ME, NAME   , 123 456, 124   , 14134, 134. 134   ,   1   ",
       newstr[sizeof str] = "";
  
  rmdelimws (newstr, str);
  
  printf ("\"%s\"\n\"%s\"\n", str, newstr);
}

Example Use/Output

$ ./bin/rmdelimws
"NA ME, NAME   , 123 456, 124   , 14134, 134. 134   ,   1   "
"NA ME,NAME,123 456,124,14134,134. 134,1"

Not only are state-loops [machines] fast, but this is a nice way to add features if you decide to expand your grammar. That code is amazing. — Neil, May 11 '22 at 04:19
Thank you, and I agree. If you can boil what you need to do down to tracking the state of a few things in a loop, then toggling an `int` `1` and `0` provides a blisteringly fast way to iterate through the problem. You can include `stdbool.h` and use `bool`, but since the standard makes no guarantee of the `sizeof(bool)`, you aren't guaranteed to save a byte or two over `int`. (basically a taste thing) And from the x86 days when `int` was the equivalent to the register size, the hardware was already optimized for it. — David C. Rankin, May 11 '22 at 05:01

yano · Answer 3 · 2022-05-09T19:56:45.103

Below works, at least for your input string. I make absolutely no claims as to its efficiency or elegance. I did not try to modify s in place, instead wrote to a new string. The algorithm I followed was:

Initialized a startPos to 0.
Loop on s until you find a comma.
Backup from that position until you find the first non-space character.
memcpy from startPos to that position to a new string.
Add a comma to the next position of the new string.
Look forward from comma position until you find the first non-space character, set that to startPos.
Rinse and repeat
At the very end, append the final token with strcat

void removeWhiteChars(char *s)
{
    size_t i = 0;
    size_t len = strlen(s);
    char* newS = calloc(1, len);
    size_t newSIndex = 0;
    size_t startPos = 0;

    while (i<len)
    {
        // find the comma
        if (s[i] == ',')
        {            
            // find the first nonspace char before the comma
            ssize_t before = i-1;
            while (isspace(s[before]))
            {
                before--;
            }
            
            // copy from startPos to before into our new string
            size_t amountToCopy = (before-startPos)+1;
            memcpy(newS+newSIndex, s+startPos, amountToCopy);
            newSIndex += amountToCopy;
            newS[newSIndex++] = ',';

            // update startPos
            startPos = i+1;
            while (isspace(s[startPos]))
            {
                startPos++;
            }
            
            // update i
            i = startPos+1;
        }
        else
        {
            i++;
        }
    }

    // finally tack on the end
    strcat(newS, s+startPos);

    // You can return newS if you're allowed to change your function
    // signature, or strcpy it to s
    printf("%s\n", newS);    
}

I have also only tested it with your input string, it may break for other cases.

Demonstration

Neil · Answer 4 · 2022-05-11T20:58:10.427

You can modify this in place in O(n) using a state machine. In this example, I've used re2c to set-up and keep the state for me.

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

static void lex(char *cursor) {
    char *out = cursor, *open = cursor, *close = 0;
start:
    /*!re2c /* Use "re2c parse.re.c -o parse.c" to get C output file. */
    re2c:define:YYCTYPE = "char";
    re2c:define:YYCURSOR = "cursor";
    re2c:yyfill:enable = 0;
    /* Whitespace. */
    [ \f\n\r\t\v]+ { if(!close) open = cursor; goto start; }
    /* Words. */
    [^, \f\n\r\t\v\x00]+ { close = cursor; goto start; }
    /* Comma: write [open, close) and reset. */
    "," {
        if(close)
            memmove(out, open, close - open), out += close - open, close = 0;
        *(out++) = ',';
        open = cursor;
        goto start;
    }
    /* End of string: write any [open, close). */
    "\x00" {
        if(close)
            memmove(out, open, close - open), out += close - open;
        *(out++) = '\0';
        return;
    }
    */
}

int main(void) {
    char command[]
        = "NA ME, NAME   , 123 456, 124   , 14134, 134. 134   ,   1   ";
    printf("<%s>\n", command);
    lex(command);
    printf("<%s>\n", command);
    return EXIT_SUCCESS;
}

This works by being lazy; that is, differing the writing of the string until we can be sure it's complete, either at a comma or the end of the string. It's quite simple, belonging to a regular language, without lookahead. It preserves whitespace between words that don't have commas between them. It also overwrites the string, so it doesn't use extra space; we can do this because the edits only involve deletion.

score 0 · Answer 5 · answered May 09 '22 at 20:12

Please try this:

void removeWhiteChars(char *s)
{
    int i = 0;
    int count = 0;
    int isSomething = 0;
    while (s[i])
    {
        if (s[i] == ',' && isSomething == 0)
            isSomething = 2;
        else if (s[i] == ',' && isSomething == 1)
            isSomething = 2;
        else if (s[i] == ',' && isSomething == 2)
        {
            s[count++] = ',';
            s[count++] = s[i];
            isSomething = 0;
        }
        else if (isspace(s[i]) && isSomething == 0)
            isSomething = 1;
        else if (isspace(s[i]) && isSomething == 1)
            isSomething = 1;
        else if (isspace(s[i]) && isSomething == 2)
            isSomething = 2;
        else if (isSomething == 1)
        {
            s[count++] = ' ';
            s[count++] = s[i];
            isSomething = 0;
        }
        else if (isSomething == 2)
        {
            s[count++] = ',';
            s[count++] = s[i];
            isSomething = 0;
        }
        else
            s[count++] = s[i];

        ++i;
    }
    s[count] = '\0'; /* adding NULL-terminate to the string */
}

AMDG · Answer 6 · 2022-05-09T22:39:17.760

Here is one possible algorithm. It is not necessarily well-optimized as presented here, but exists to demonstrate one possible implementation of an algorithm. It is intentionally partially abstract.

The following is a very robust O(n) time algorithm you may use to trim whitespace (among other things if you generalize and extend it).

This implementation has not been verified to work as-is, however.

You should track the previous character and relevant spaces so that if you see { ',', ' ' } or { CHAR_IN_ALPHABET, ' '}, you begin a chain, and a value representing the current path of execution. When you see any other character, the chain should break if the first sequence, and vice versa if the second sequence is detected. We'll be defining a function:

// const char *const in: indicates intent to read from in only
void trim_whitespace(const char *const in, char *out, uint64_t const out_length);

We are defining a definite algorithm in which all execution paths are known, so for each unique possible state of execution, you should assign a numeric value increasing linearly beginning from zero using enums defined within the function for readability, and switch statements (unless goto and labels better models the behavior of the algorithm):

void trim_whitespace(const char *const in, char *out, uint64_t const out_length) {
    // better to use ifdefs first or avoid altogether with auto const variable,
    // but you get the point here without all that boilerplate
    #define CHAR_NULL 0

    enum {
        DEFAULT = 0,
        WHITESPACE_CHAIN
    } execution_state = DEFAULT;
    
    // track if loop is executing; makes the logic more readable;
    // can also detect environment instability
    // volatile: don't want this to be optimized out of existence
    volatile bool executing = true;

    while(executing) {
        switch(execution_state) {
        case DEFAULT:
            ...
        case WHITESPACE_CHAIN:
            ...
        default:
            ...
        }
    }

    function_exit:
        return;

    // don't forget to undefine once finished so another function can use
    // the same macro name!
    #undef CHAR_NULL
}

The number of possible execution states is equal to 2**ceil(log_2(n)) where n is the number of actual execution states relevant to the operation of the current algorithm. You should explicitly name them and make cases for them in the switch statement.

In the DEFAULT case, we're only checking for commas and "legal" characters. If the previous character was a comma or legal character, and the current character is a space, then we want to set the state to WHITESPACE_CHAIN.

In the WHITESPACE_CHAIN case, we test if the current chain can be trimmed based on whether the character we began with was a comma or legal character. If the current character can be trimmed, it is simply skipped and we go to the next iteration until we hit another comma or legal character depending on what we're looking for, then set the execution state to DEFAULT. If we determine this chain to not be trimmable, then we add all the characters we skipped and set the execution state back to DEFAULT.

The loop should look something like this:

...
// black boxing subjectives for portability, maintenance, and readability
bool is_whitespace(char);
bool is_comma(char);
// true if the character is allowed in the current context
bool is_legal_char(char);
...

volatile bool executing = true;

// previous character (only updated at loop start, line #LL)
char previous = CHAR_NULL;
// current character (only updated at loop start, line #LL)
char current = CHAR_NULL;
// writes to out if true at end of current iteration; doesn't write otherwise
bool write = false;
// COMMA: the start was a comma/delimeter
// CHAR_IN_ALPHABET: the start was a character in the current context's input alphabet
enum { COMMA=0, CHAR_IN_ALPHABET } comma_or_char = COMMA;

// current character index (only updated at loop end, line #LL)
uint64_t i = 0, j = 0;

while(executing) {
    previous = current;
    current = in[i];

    if (!current) {
        executing = false;
        break;
    }

    switch(execution_state) {
        case DEFAULT:
            if (is_comma(previous) && is_whitespace(current)) {
                execution_state = WHITESPACE_CHAIN;
                write = false;
                comma_or_char = COMMA;
            } else if (is_whitespace(current) && is_legal_char(previous)) { // whitespace check first for short circuiting
                execution_state = WHITESPACE_CHAIN;
                write = false;
                comma_or_char = CHAR_IN_ALPHABET;
            }
            
            break;

        case WHITESPACE_CHAIN:
            switch(comma_or_char) {
                case COMMA:
                    if (is_whitespace(previous) && is_legal_char(current)) {
                        execution_state = DEFAULT;
                        write = true;
                    } else if (is_whitespace(previous) && is_comma(current)) {
                        execution_state = DEFAULT;
                        write = true;
                    } else {
                        // illegal condition: logic error, unstable environment, or SEU
                        executing = true;
                        out = NULL;
                        goto function_exit;
                    }

                    break;

                case CHAR_IN_ALPHABET:
                    if (is_whitespace(previous) && is_comma(current) {
                        execution_state = DEFAULT;
                        write = true;
                    } else if (is_whitespace(previous) && is_legal_char(current)) {
                        // abort: within valid input string/token
                        execution_state = DEFAULT;
                        write = true;
                        // make sure to write all the elements we skipped; 
                        // function should update the value of j when finished
                        write_skipped(in, out, &i, &j);
                    } else {
                        // illegal condition: logic error, unstable environment, or SEU
                        executing = true;
                        out = NULL;
                        goto function_exit;
                    }

                    break;

                default:
                    // impossible condition: unstable environment or SEU
                    executing = true;
                    out = NULL;
                    goto function_exit;
            }
            
            break;

        default:
            // impossible condition: unstable environment or SEU
            executing = true;
            out = NULL;
            goto function_exit;
    }

    if (write) {
        out[j] = current;
        ++j;
    }

    ++i;
}

if (executing) {
    // memory error: unstable environment or SEU
    out = NULL;
} else {
    // execution successful
    goto function_exit;
}

// end of function

Please kindly also use the word whitespace to describe these characters as that is what they are commonly known as, not "white chars".

Here, I actually use `goto` to avoid having `return;` statements within the loop and switch statements. It's better for control flow maintenance and readability and more easily modified if it becomes necessary to do so by simply modifying content after the label. — AMDG, May 09 '22 at 22:44
I agree with your `goto` use; this is a regular language, equivalent to a finite automaton or state machine. — Neil, May 10 '22 at 03:00
@Neil not that I care if other programmers seethe about it, hehehe. `goto` is a tool like any other. For complex loop behavior, goto with labels is _far more readable_ and lightweight compared to a bunch of ifs, switches, and inline functions. `while` and `for` have their place with strict, one-dimensional, iterative behaviors, but `goto` is best for anything beyond just looping over an array and applying a single non-branching transform to its elements. — AMDG, May 10 '22 at 23:49

Remove white chars between commas, but not between what inside the commas

6 Answers6

Linked