Scan and swap string values with regex in ANSI C

Question

I want to transform a given input in my c program, for example:

foo_bar_something-like_this

into this:

thissomethingbarfoolike

Explanation:

Every time I get a _, the following text up to, but not including, the next _ or - (or the end of the line) needs to go to the beginning (and the preceding _ needs to be removed). Every time I get a -, the following text up to, but not including, the next _ or - (or the end of the line) needs to be appended to the end (with the - removed).

If possible, I would like to use regular expressions in order to achieve this. If there is a way to do this directly from stdin, it would be optimal.

Note that it is not necessary to do it in a single regular expression. I can do some kind of loop to do this. In this case I believe I would have to capture the data in a variable first and then do my algorithm.

I have to do this operation for every line in my input, each of which ends with \n.

EDIT: I had already written a code for this without using anything related to regex, besides I should have posted it in the first place, my apologies. I know scanf should not be used to prevent buffer overflow, but the strings are already validated before being used in the program. The code is the following:

#include <stdio.h>
#include <stdlib.h>
#define MAX_LENGTH 100001 //A fixed maximum amount of characters per line
int main(){
  char c=0;
  /*
  *home: 1 (append to the start), 0 (append to the end)
  *str: array of words appended to the begining
  *strlen: length of str
  *line: string of words appended to the end
  *linelen: length of line
  *word: word between a combination of symbols - and _
  *wordlen: length of the actual word
  */
  int home,strlen,linelen,wordlen;
  char **str,*line,*word;
  str=(char**)malloc(MAX_LENGTH*sizeof(char*));
  while(c!=EOF && scanf("%c",&c)!=EOF){
    line=(char*)malloc(MAX_LENGTH);
    word=(char*)malloc(MAX_LENGTH);
    line[0]=word[0]='\0';
    home=strlen=linelen=wordlen=0;
    while(c!='\n'){
      if(c=='-'){ //put word in str and restart word to '\0'
        home=1;
        str[strlen++]=word;
        word=(char*)malloc(MAX_LENGTH);
        wordlen=0;
        word[0]='\0';
      }else if(c=='_'){ //put word in str and restart word to '\0'
        home=0;
        str[strlen++]=word;
        word=(char*)malloc(MAX_LENGTH);
        wordlen=0;
        word[0]='\0';
      }else if(home){ //append the c to word
        word[wordlen++]=c;
        word[wordlen]='\0';
      }else{ //append c to line
        line[linelen++]=c;
        line[linelen]='\0';
      }
      scanf("%c",&c); //scan the next character
    }
    printf("%s",word); //print the last word
    free(word);
    while(strlen--){ //print each word stored in the array
      printf("%s",str[strlen]);
      free(str[strlen]);
    }
    printf("%s\n",line); //print the text appended to the end
    free(line);
  }
  return 0;
}

There are no regular expressions defined by Standard (ANSI) C — the standard C library does not include a regular expression package. You need to decide how you're going to work around it. There are a myriad regular expression packages available — [PCRE](https://pcre.org/) is one such, POSIX provides [`regcomp()`](http://pubs.opengroup.org/onlinepubs/9699919799/functions/regcomp.html) et al, and there are others too, of greater or lesser power. — Jonathan Leffler, Nov 18 '17 at 22:19
Your explanation of what you need to do needs clarifying. Is it the material before the `_` or after it that needs to go to the beginning? It would probably help to see some intermediate results. When you scan the second `_`, what should the output buffer look like? It isn't clear that this is something that can or should be handled by regular expressions. It is more complex than regular expressions normally handle. — Jonathan Leffler, Nov 18 '17 at 22:22
@JonathanLeffler Actually, this is something that can *easily* be done using simple regexes. See my answer. — robinCTS, Nov 19 '17 at 03:22
@JonathanLeffler Also, the explanation is ***very*** clear to me, having looked at the example. However, I have reworded the explanation for the benefit of those having trouble doing so. Actually, as it turns out, it doesn't *matter* whether it is the material before the `_` or after it that goes to the beginning. **The result is the same either way.** The only thing that *really* makes a difference is whether the material before or after the `-` goes to the end. As it turns out, the example clearly shows that it's the material after that needs to be moved. — robinCTS, Nov 19 '17 at 09:22
@JonathanLeffler To really clarify the question the OP should have posted a better example. `foo_bar_something-like-this_or-that_` with the corresponding output `orsomethingbarfoolikethisthat` would cover all the other possible unspecified cases. — robinCTS, Nov 19 '17 at 09:22
I just added the code I had, I don't know if this is relevant for what I asked, but it appears that it will actually not be possible to achieve using regex with ANSI. — Franco Eduardo Ramírez Reyes, Nov 20 '17 at 00:08
@Franco Not sure if your comment was directed at anybody in particular or just in general, but if you want another user to get notified that you have added a comment, you need to "ping" them by using an @ like I did in this comment. (Note that the author of a post always gets pinged.) See [this FAQ page](//meta.stackexchange.com/questions/43019) for further details. As to the added code, yes it is relevant. It shows you have made an effort and are *really* asking if it's possible to use regexes as an alternative, instead of just possibly "fishing" for a complete solution. — robinCTS, Nov 20 '17 at 01:36

score 1 · Answer 1 · answered Nov 19 '17 at 03:11

I do not think regex can do what you are asking for, so I wrote a simple state machine solution in C.

//
//Discription: This Program takes a string of character input, and parses it
//using underscore and hyphen as queue to either send data to
//the begining or end of the output.
//
//Date: 11/18/2017
//
//Author: Elizabeth Harasymiw
//

#include <stdio.h>
#include <string.h>
#define MAX_SIZE 100

typedef enum{ AppendEnd, AppendBegin } State; //Used to track either writeing to begining or end of output

int main(int argc,char**argv){
        char ch;                   //Used to hold the character currently looking at
        State state=AppendEnd;     //creates the State
        char Buffer[MAX_SIZE]={};  //Current Ouput
        char Word[MAX_SIZE]={};    //Pending data to the Buffer
        char *c;                   //Used to index and clear Word
        while((ch = getc(stdin)) != EOF){
                if(ch=='\n')continue;
                switch(state){
                        case AppendEnd:
                                if( ch == '-' )
                                        break;
                                if( ch == '_'){
                                        state = AppendBegin;     //Change State
                                        strcat(Buffer, Word);    //Add Word to end of Output
                                        for(c=Word;*c;c++)*c=0;  //Clear Word
                                        break;
                                }
                                {
                                        int postion = -1;
                                        while(Word[++postion]);  //Find end of Word
                                        Word[postion] = ch;      //Add Character to end of Word
                                }
                                break;
                        case AppendBegin:
                                if( ch == '-' ){
                                        state = AppendEnd;       //Change State
                                        strcat(Word, Buffer);    //Add Output to end of Word
                                        strcpy(Buffer, Word);    //Move Output from Word back to Output
                                        for(c=Word;*c;c++)*c=0;  //Clear Word
                                        break;
                                }
                                if( ch == '_'){
                                        strcat(Word, Buffer);    //Add Output to end of Word
                                        strcpy(Buffer, Word);    //Move Output from Word back to Output
                                        for(c=Word;*c;c++)*c=0;  //Clear Word
                                        break;
                                }
                                {
                                        int postion = -1;
                                        while(Word[++postion]);  //Find end of Word
                                        Word[postion] = ch;      //Add Character to end of Word
                                }
                                break;

                }
        }
        switch(state){ //Finish adding the Last Word Buffer to Output
                case AppendEnd:
                        strcat(Buffer, Word); //Add Word to end of Output
                        break;
                case AppendBegin:
                        strcat(Word, Buffer); //Add Output to end of Word
                        strcpy(Buffer, Word); //Move Output from Word back to Output
                        break;
        }

        printf("%s\n", Buffer);
}

Actually, you *can* use regexes to solve the problem. See my answer. — robinCTS, Nov 19 '17 at 03:19

score 1 · Answer 2 · edited Jun 20 '20 at 09:12

1

This can be done with regexes using loops, assuming you aren't strictly restricted to ANSI. The following uses PCRE.

^{(Note that this answer deliberately does not show the C code. It is only meant to guide the OP by showing a possible technique for using regexes, as it is not obvious how to do so.)}

Method A

Uses two different regexes.

Part 1/2 (Demo)

Regex: ([^_\n]*)_([^_\n]*)(_.*)? Substitution: $2--$1$3

This moves the text following the next underscore to the beginning, appending -- to it. It also removes the underscore. You need to repeat this substitution in a loop until no more matches are found.

For your example, this leads to the following string:

this--something-like--bar--foo

Part 2/2 (Demo):

Regex: (.*)(?<!-)-(?!-)(\w+)(.*) Substitution: $1$3--$2

This moves the text following the next single hyphen to the end, prepending -- to it. It also removes the hyphen. You need to repeat this substitution in a loop until no more matches are found.

For your example, this leads to the following string:

this--something--bar--foo--like

Remove the hyphens from the string to get your result.

Note that the first regex can be simplified to the following and will still work:

([^_]*)_([^_]*)(_.*)?

The \ns were only required to show the intermediate loop results in the demos.

The following are the reasons for using -- as a new separator:

A separator is required so that the regex in part 2 can find the correct end of hyphen prefixed text;
A underscore can't be used as it would interfere with the regex in part 1 causing an infinite loop;
A hyphen can't be used as it would cause the regex in part 2 to find extraneous text;
Although any single character delimiter which can never exist in the input would work and lead to a simpler part 2 regexp, -- is one of the delimiters which allows any and every character^* in the input.
\n is actually the perfect ^* delimiter, but can't be used in this answer as it would not allow the demo to show the intermediate results. (Hint: it should be the actual delimiter used by you.)

Method B

Combines the two regexes.

(Demo)

Regex: ([^_\n]*)_([^_\n]*)(_.*)?|(.*)(?<!-)-(?!-)(\w+)(.*) Substitution: $2--$1$3$4$6--$5

For your example, this leads to the following string:

----this------something--bar--foo----like

As before, remove all the hyphens from the string to get your result.

Also as before, the regex can be simplified to the following and will still work:

([^_]*)_([^_]*)(_.*)?|(.*)(?<!-)-(?!-)(\w+)(.*)

This combined regex works because capturing groups 1,2 & 3 are mutually exclusive to groups 4, 5 & 6. There is a side effect of extra hyphens, however.

Caveat:

^* Using -- as a delimiter fails if the input contains consecutive hyphens. All the other "good" delimiters have a similar failure edge case. Only \n is guaranteed not to exist in the input and thus is failsafe.

edited Jun 20 '20 at 09:12

Community

1
1

answered Nov 19 '17 at 03:16

robinCTS

5,746
14
30
37

Using loops etc with regular expressions is one thing — and yes, you can do it piecemeal like that. Using just regular expressions is another thing, and no, I don't think you can do it without constructs such as loops from outside the regular expressions. You've also not explained which regex package you're using (PCRE?), nor shown the C code that this question is tagged with. – Jonathan Leffler Nov 19 '17 at 03:56
Your regex demo seems to transform the question's `foo_bar_something-like_this` into `foo_bar_something--like_this`; it is not clear that this is similar to `thissomethingbarfoolike` which is the desired output. Case not yet proven. – Jonathan Leffler Nov 19 '17 at 04:02
@JonathanLeffler 1a) The OP specifically states that regexes with looping are allowable. 1b) At no point did I claim it *could* be done without looping. My comment "can easily be done using simple regexes" was replying to your comment - "It isn't clear that this is something that can [...] be handled by regular expressions." 1c) In reply to "It is more complex than regular expressions normally handle", while it *is* typical to parse the results of a global set of matches of a regex execution via a loop, it is also not un-common to loop non-globally until the search is exhausted. – robinCTS Nov 19 '17 at 05:50
2) Thank you. I had already realised that I had left out the regex package I had used and was in the middle of a edit to add that, as well as other things. 3) **StackOverflow is *not* a code writing or a "do your homework for you" site**. Technically speaking, this question should be closed as "too broad" or "off topic". It actually has two close votes on it already to that effect. Despite that I thought I'd help out with the regex part. I *was* planning on only posting a comment, but it ended up that my reply would have been a little too extensive for that. – robinCTS Nov 19 '17 at 05:50
4) I suggest you look at the demos more carefully. Each line at the top represents the progressive input to be feed into the loop. Each line at the bottom represents the progressive output of the loop. I have no idea where you get `foo_bar_something--like_this` as an ouput. Note that I have now added the `-me` to demonstrate the correct ordering of the postfixed text, but this doesn't really change your claim. The output of the first step of part 1 is `bar--foo_something-like_this-me` and the output of the final step is`this-me--something-like--bar--foo`. – robinCTS Nov 19 '17 at 05:50
5) **Case actually proven!** – robinCTS Nov 19 '17 at 05:51
1

This solution is actually really good, and I would use it in my algorithm if I wasn't forced to use ANSI. It seems way more condensed than my actual code. Thank you very much for your reply. It's not my homework, by the way. A friend gave me this problem and I solved it, but I believe it could be way more efficient if I used or could be able to use regex than what I did. – Franco Eduardo Ramírez Reyes Nov 20 '17 at 01:09
@FrancoEduardoRamírezReyes Thanks. I wasn't implying that your question was homework. Homework questions *are* actually allowed, provided one makes an effort and posts some code with some specific problem. Any too broad question, like the one you originally posted risks getting down voted/closed. Your edit makes it a lot better. – robinCTS Nov 20 '17 at 01:25

Scan and swap string values with regex in ANSI C

2 Answers2

Method A

Method B

Caveat: