C/C++ use a switch case match all alphabet

Question

I have a long if () ... else if () ... else if() ... code,similar to：

int token;
if((token >= 'a' && token <= 'z') || (token >= 'A' && token <= 'Z'))
    // ...
else if (token == '\n')
    // ...
else if (token == '^')
    // ...
else if (token == '&')
    // ...

It has a lot of '==' and few scopes like token >= 'a' && token <= 'Z',So I want to use switch to rewrite this if else，But using case to match all alphabet is cumbersome.I know it can be written as the following code:

int token;
switch (token) {
    case 'a':
    case 'b':
    case 'c':
    case 'd':
    case 'e':
    case 'f':
    case 'g':
    case 'h':
    case 'i':
    case 'j':
    case 'k':
    case 'l':
    case 'm':
    case 'n':
    case 'o':
    case 'p':
    case 'q':
    case 'r':
    case 's':
    case 't':
    case 'u':
    case 'v':
    case 'w':
    case 'x':
    case 'y':
    case 'z':
        // ...
        break;

}

But I think this is not concise, so I want to ask if there is a more concise way to use a case to match a-z and A-Z

A switch is really the wrong tool for doing ranges ... Also, there's `std::isalpha`, — ChrisMM, Feb 25 '23 at 18:18
GCC has it as its extension. [Case Ranges (Using the GNU Compiler Collection (GCC))](https://gcc.gnu.org/onlinedocs/gcc/Case-Ranges.html) — MikeCAT, Feb 25 '23 at 18:18
Instead of `(token >= 'a' && token <= 'z') || (token >= 'A' && token <= 'Z')`, why not `isalpha(token)`? — Some programmer dude, Feb 25 '23 at 18:19
What do you want to do in each `case`? If we know this we might be able to suggest a better approach. — Richard Critten, Feb 25 '23 at 18:19
And guessing by the variable `token`, you're doing some kind of parser or lexical recognizer? Then why not add a simple `if` for recognizing "names", and else use a `switch` for other special symbols? That's usually what I do. — Some programmer dude, Feb 25 '23 at 18:21
You can use a 256 byte lookup table (or 128 byte for 7-bit ASCII) to map each character to an action, and then `switch` on the action. All of the upper and lower case letters would have the same action. Operators like `+`, `-`, `*`, `/` and `%` might also have the same action. — user3386109, Feb 25 '23 at 18:40
@Ne C First of all describe how you want to classify characters. — Vlad from Moscow, Feb 25 '23 at 18:44
A weakness/strength to `isalpha()` is that it is _locale_ sensitive. Depends on coding goals if it is _better_ or not. — chux - Reinstate Monica, Feb 25 '23 at 18:53
A side note on `token >= 'a' && token <= 'z'`: It's very unusual these days, but there's no guarantee that the letters are stored in ascending order. Or any sane order, for that matter. Best to use the collection of `is...` library functions to be absolutely certain there are no boobytraps coming your way. — user4581301, Feb 25 '23 at 19:38
As you can imagine, this is part of an interpreter, I decided to use if with switch instead of just if or switch. @Someprogrammerdude — Ne C, Feb 26 '23 at 08:59

score 6 · Accepted Answer · answered Feb 25 '23 at 18:20

6

Use case in switch for dedicated chars and default: label for ranges:

switch (token) {
case '\n':
  // ...
  break;
case '&':
  // ...
  break;
case '^':
  // ...
  break;
default:
  if (std::isalpha(token)) {
    // ...
  }
  break;
}

Or a bit unusual

if (std::isalpha(token)) {
  // ...
} else switch (token) {
case '\n':
  // ...
  break;
case '&':
  // ...
  break;
case '^':
  // ...
  break;
}

answered Feb 25 '23 at 18:20

273K

29,503
10
41
64

Nice. I like the switch for out of range characters. – Thomas Matthews Feb 25 '23 at 19:22

chux - Reinstate Monica · Answer 2 · 2023-02-25T23:49:11.657

6

Use a table to simplify the switch(). @user3386109

  // Table look-up of character to switch index.
  static const unsigned char type[UCHAR_MAX + 1u] = { //
      ['A'] = 1, ['B'] = 1, ['C'] = 1, ['D'] = 1,['E'] = 1, //
      ['F'] = 1, ['G'] = 1, ['H'] = 1, ['I'] = 1,['J'] = 1, //
      ['K'] = 1, ['L'] = 1, ['M'] = 1, ['N'] = 1,['O'] = 1, //
      ['P'] = 1, ['Q'] = 1, ['R'] = 1, ['S'] = 1,['T'] = 1, //
      ['U'] = 1, ['V'] = 1, ['W'] = 1, ['X'] = 1,['Y'] = 1, //
      ['Z'] = 1, //
      ['a'] = 1, ['b'] = 1, ['c'] = 1, ['d'] = 1,['e'] = 1, //
      ['f'] = 1, ['g'] = 1, ['h'] = 1, ['i'] = 1,['j'] = 1, //
      ['k'] = 1, ['l'] = 1, ['m'] = 1, ['n'] = 1,['o'] = 1, //
      ['p'] = 1, ['q'] = 1, ['r'] = 1, ['s'] = 1,['t'] = 1, //
      ['u'] = 1, ['v'] = 1, ['w'] = 1, ['x'] = 1,['y'] = 1, //
      ['z'] = 1, //
      ['\n'] = 2, //
      ['^'] = 3, //
      ['&'] = 4, //
      // Other elements are 0 since they are not explicitly initialized.
      };

  unsigned char token;
  switch (type[token]) {
    case 1: ...  break; // letters
    case 2: ...  break; // \n
    case 3: ...  break; // ^
    case 4: ...  break; // &
    default: // None of the above.
  }

This somewhat replicates is...() routines, but does not have a locale variance^*1 and 2) can be customized to your parsing needs.

Better to use named constants/enum than 1,2,3,4...

Speed

I suspect OP is using this code for tokenizing and belongs to the 3% of the time micro-optimizations are worth it.

^*1 Advanced: Many is...() functions have some locale dependence. Example: "In the "C" locale, isalpha returns true only for the characters for which isupper or islower is true." C17dr § 7.4.1.2 2
This mostly affects characters that are non-ASCII (outside the 0-127) range. When tokenizing for a specific protocol, keep in mind that is...() function have locale variations.

edited Feb 25 '23 at 23:49

answered Feb 25 '23 at 19:03

chux - Reinstate Monica

143,097
13
135
256

2

An alternative to an `enum` is to map each class to a canonical representative: `['A'] = 'A', ['B'] = 'A', ['C'] = 'A',… ['Z'] = 'A',… ['\n'] = '\n',…` and `case 'A' : … case '\n': …`. – Eric Postpischil Feb 25 '23 at 19:07
1

@EricPostpischil Yes, that approach lends itself to added clarity. I suspect it makes for a more complex `switch()` code. Later on, OP could use the 0, 1, 2, 3 , ... approach for a function table look-up. – chux - Reinstate Monica Feb 25 '23 at 19:12
Please, indicate that this is a fun answer and that people should not actually do that. Not everybody might get that... – NeitherNor Feb 25 '23 at 19:19
1

@NeitherNor: Why do you think people should not actually do that? Mapping to a class and switching on the class is a reasonable solution. – Eric Postpischil Feb 25 '23 at 19:26
1

@NeitherNor Depending on coding goal, this is a very good way to do it. It is fast and highly portable. For learner exercises, yes, `isapha()` is OK, but that has _locale_ issues. What is best really depends on many other criteria not posted by OP. – chux - Reinstate Monica Feb 25 '23 at 19:27
2

@NeitherNor And what is wrong about it? I find it brilliant, to say the least. – Harith Feb 25 '23 at 19:28
1

@NeitherNor This is a very well written and serious answer. Please elaborate on your comment, or admit you were trolling us. – Costantino Grana Feb 25 '23 at 19:30
Code like this is hard to read and maintain. We might get talking if this is really about a rare case of micro-optimization (this was edited in after my answer), but nothing like this was indicated in the original question. So, the original version of the answer, without any "micro-opt" remark, was confusing at best. – NeitherNor Feb 25 '23 at 19:53
2

@NeitherNor: Re “Code like this is hard to read and maintain”: No, it is not. This code is straightforward and easily understood and maintained. At most it needs a comment introducing the fact it maps to a class and dispatches based on the class. – Eric Postpischil Feb 25 '23 at 20:09
@EricPostpischil: If you consider scrolling through 256-element lookup table littered source code easy to read and maintain, we clearly have different standards. The whole thing might be acceptable in highly optimized code, _after_ it was shown to be a bottleneck and with some data actually proving that this is faster. There, one might also elaborate on why one cannot assume that A-Z are in sequence but, at the same time, assume that \n is used for a line break... – NeitherNor Feb 25 '23 at 21:26
2

@NeitherNor: What is “scrolling through 256-element lookup table”? You do not have to “scroll through” the table to understand its function. As I mentioned, a simple comment will introduce this code, and any moderately education software engineer will understand it easily. And `\n` is defined to be a new-line character by the C standard. I would grade this code as close to trivial. – Eric Postpischil Feb 25 '23 at 22:00
1

I liked your table driven approach. Since I've used similar approaches in the past, it was perfectly clear to me. So, I don't know why you got all the brouhaha over it. In my answer, I used your code as the starting point to show some code I used in a real situation that is, perhaps, even more obscure to a less advanced programmer. – Craig Estey Feb 26 '23 at 01:09

Harith · Answer 3 · 2023-02-25T19:35:28.423

2

I want to ask if there is a more concise way to use a case to match a-z and A-Z.

Yes, there is.

int token;
if((token >= 'a' && token <= 'z') || (token >= 'A' && token <= 'Z'))
    // ...

can be simplified to just:

#include <ctype.h>

int token;
if(isalpha((unsigned char)token))
    // ...

isalpha() checks for an alphabetic character. The value returned is nonzero if the character c falls into the tested class, and zero if not.

For the rest of the symbols, use either the — preferably — switch() statement, or the if/else ladder. The functions declared in ctype.h could help simplify things more.

As commented by @Mike, GCC provides a useful extension: Case Ranges (Using the GNU Compiler Collection (GCC)).

Also see @Toby's answer here to see the discrepancy between the two solutions.

edited Feb 25 '23 at 19:35

answered Feb 25 '23 at 18:40

Harith

4,663
1
5
20

1

The two are functionally identical _if_ we can assume a character set where a-z are represented by contiguous integers. Fortunately, using `isalpha` (or `std::isalpha` if this is C++) does not require us to make this assumption. – Chris Feb 25 '23 at 18:52
1

@Chris, [Almost](https://stackoverflow.com/questions/75567462/c-c-use-a-switch-case-match-all-alphabet#comment133322225_75567462) identical. – chux - Reinstate Monica Feb 25 '23 at 18:55

0___________ · Answer 4 · 2023-02-25T19:19:35.817

2

I want to ask if there is a more concise way to use a case to match a-z and A-Z

GCC has an extension:

void foo(char x)
{
    switch(x)
    {
        case 'a' ... 'z':
            printf("Lower case letter\n");
            break;
        case 'A' ... 'Z':
            printf("Upper case letter\n");
            break;
        case '0' ... '9':
            printf("Digit\n");
            break;
        default:
            printf("Something else\n");
            break;
    }
}

But of course, it will not compile using non-GCC family compilers.

edited Feb 25 '23 at 19:19

answered Feb 25 '23 at 19:12

0___________

60,014
4
34
74

1

Given all the crap they added in C2x, I wonder why they did not standardize this goodie? – chqrlie Feb 26 '23 at 01:19
@chqrlie I took 40y to spot that computers are using the binary system. So maybe in C55 they will add it. They are like God. He also did not consult 10 Commandments. – 0___________ Feb 26 '23 at 08:37
I tried this extension, it's a nice feature, but I'm trying to develop this on msvc, hope it gets added to the standard sooner – Ne C Feb 26 '23 at 08:45
@NeC `abandon all hope, ye who enter here` – 0___________ Feb 26 '23 at 08:47
@0___________: my favorite comment: https://github.com/bobbae/gosling-emacs/blob/master/display.c – chqrlie Feb 26 '23 at 10:14

score 1 · Answer 5 · answered Feb 26 '23 at 01:03

I like chux's approach of using a lookup table. In the source below, this is swchux.

It generates:

movzbl %dil,%edi
movzbl 0(%rdi),%eax

And, then 4-5 of:

cmp $x,%al
jxx ...

This is really fast with a limited number of case statements.

But, with a larger number of case statements, the cmp/jxx entries take up a significant amount of time.

I had a situation where there was a switch/case block with a hundred or so entries. So, this didn't scale.

By using a computed goto (using &&label), we can reduce this to (in swfix1):

movzbl %dil,%edi
movzbl 0(%rdi),%eax
jmp *tbl(,%rax,8)

For the use case I had, using the computed goto instead of the switch improved overall performance by 30%.

With some cpp macros, we can make the syntax similar to a switch/case block.

In the above examples, we're using an unsigned char lookup. If we use a direct label table, we can reduce this by one instruction (in swfix2):

movzbl %dil,%edi
jmpq   *0x0(,%rdi,8)

This eliminates one asm instruction at the expense of the lookup table using 8 bytes / entry (vs. 1 byte for the above).

Here is the .c source code for the above examples.

Note that here I just used the DOIT macro as a placeholder for the actual code in the case. In real code, each case would have its own/different code.

#include <limits.h>

int state;

#define DOIT(val_) \
    state = 256 + val_

static const unsigned char type[UCHAR_MAX + 1u] = { //
    ['A'] = 1,['B'] = 1,['C'] = 1,['D'] = 1,['E'] = 1,  //
    ['F'] = 1,['G'] = 1,['H'] = 1,['I'] = 1,['J'] = 1,  //
    ['K'] = 1,['L'] = 1,['M'] = 1,['N'] = 1,['O'] = 1,  //
    ['P'] = 1,['Q'] = 1,['R'] = 1,['S'] = 1,['T'] = 1,  //
    ['U'] = 1,['V'] = 1,['W'] = 1,['X'] = 1,['Y'] = 1,  //
    ['Z'] = 1,                      //
    ['a'] = 1,['b'] = 1,['c'] = 1,['d'] = 1,['e'] = 1,  //
    ['f'] = 1,['g'] = 1,['h'] = 1,['i'] = 1,['j'] = 1,  //
    ['k'] = 1,['l'] = 1,['m'] = 1,['n'] = 1,['o'] = 1,  //
    ['p'] = 1,['q'] = 1,['r'] = 1,['s'] = 1,['t'] = 1,  //
    ['u'] = 1,['v'] = 1,['w'] = 1,['x'] = 1,['y'] = 1,  //
    ['z'] = 1,                      //
    ['\n'] = 2,                     //
    ['^'] = 3,                      //
    ['&'] = 4,                      //
    // Other elements are 0 since they are not explicitly initialized.
};

void
swchux(unsigned char token)
{

    switch (type[token]) {
    case 1:
        DOIT(1);
        break;                          // letters
    case 2:
        DOIT(2);
        break;                          // \n
    case 3:
        DOIT(3);
        break;                          // ^
    case 4:
        DOIT(4);
        break;                          // &
    default:                            // None of the above.
        DOIT(0);
        break;
    }
}

#define CASE(idx_) \
    CASE_##idx_
#define V(case_) \
    &&CASE(case_)

#undef SWITCH
#define SWITCH(idx_) \
    goto *swvec[idx_]

void
swfix1(unsigned char token)
{

    static void *swvec[5] = {
        V(0),
        V(1),
        V(2),
        V(3),
        V(4),
    };

    do {
        SWITCH(type[token]);

        CASE(1):
            DOIT(1);
            break;                          // letters
        CASE(2):
            DOIT(2);
            break;                          // \n
        CASE(3):
            DOIT(3);
            break;                          // ^
        CASE(4):
            DOIT(4);
            break;                          // &
        CASE(0):
            DOIT(0);
            break;
    } while (0);
}

#undef SWITCH
#define SWITCH(idx_) \
    goto *swvec[idx_]

void
swfix2(unsigned char token)
{

    static const void *swvec[UCHAR_MAX + 1u] = {    //
        ['A'] = V(1),['B'] = V(1),['C'] = V(1),['D'] = V(1),['E'] = V(1),   //
        ['F'] = V(1),['G'] = V(1),['H'] = V(1),['I'] = V(1),['J'] = V(1),   //
        ['K'] = V(1),['L'] = V(1),['M'] = V(1),['N'] = V(1),['O'] = V(1),   //
        ['P'] = V(1),['Q'] = V(1),['R'] = V(1),['S'] = V(1),['T'] = V(1),   //
        ['U'] = V(1),['V'] = V(1),['W'] = V(1),['X'] = V(1),['Y'] = V(1),   //
        ['Z'] = V(1),                       //
        ['a'] = V(1),['b'] = V(1),['c'] = V(1),['d'] = V(1),['e'] = V(1),   //
        ['f'] = V(1),['g'] = V(1),['h'] = V(1),['i'] = V(1),['j'] = V(1),   //
        ['k'] = V(1),['l'] = V(1),['m'] = V(1),['n'] = V(1),['o'] = V(1),   //
        ['p'] = V(1),['q'] = V(1),['r'] = V(1),['s'] = V(1),['t'] = V(1),   //
        ['u'] = V(1),['v'] = V(1),['w'] = V(1),['x'] = V(1),['y'] = V(1),   //
        ['z'] = V(1),                       //
        ['\n'] = V(2),                      //
        ['^'] = V(3),                       //
        ['&'] = V(4),                       //
        // Other elements are 0 since they are not explicitly initialized.
    };

    do {
        SWITCH(token);

        CASE(1):
            DOIT(1);
            break;                          // letters
        CASE(2):
            DOIT(2);
            break;                          // \n
        CASE(3):
            DOIT(3);
            break;                          // ^
        CASE(4):
            DOIT(4);
            break;                          // &
        CASE(0):
            DOIT(0);
            break;
    } while (0);
}

Here is the source built with -S:

    .file   "all.c"
    .text
    .p2align 4,,15
    .globl  swchux
    .type   swchux, @function
swchux:
.LFB0:
    .cfi_startproc
    movzbl  %dil, %edi
    movzbl  type(%rdi), %eax
    cmpb    $2, %al
    je  .L2
    jbe .L10
    cmpb    $3, %al
    je  .L6
    cmpb    $4, %al
    jne .L5
    movl    $260, state(%rip)
    ret
    .p2align 4,,10
    .p2align 3
.L10:
    cmpb    $1, %al
    jne .L5
    movl    $257, state(%rip)
    ret
    .p2align 4,,10
    .p2align 3
.L2:
    movl    $258, state(%rip)
    ret
    .p2align 4,,10
    .p2align 3
.L5:
    movl    $256, state(%rip)
    ret
    .p2align 4,,10
    .p2align 3
.L6:
    movl    $259, state(%rip)
    ret
    .cfi_endproc
.LFE0:
    .size   swchux, .-swchux
    .p2align 4,,15
    .globl  swfix1
    .type   swfix1, @function
swfix1:
.LFB1:
    .cfi_startproc
    movzbl  %dil, %edi
    movzbl  type(%rdi), %eax
    jmp *swvec.1969(,%rax,8)
    .p2align 4,,10
    .p2align 3
.L17:
    movl    $256, state(%rip)
    ret
    .p2align 4,,10
    .p2align 3
.L16:
    movl    $260, state(%rip)
    ret
    .p2align 4,,10
    .p2align 3
.L15:
    movl    $259, state(%rip)
    ret
    .p2align 4,,10
    .p2align 3
.L14:
    movl    $258, state(%rip)
    ret
    .p2align 4,,10
    .p2align 3
.L12:
    movl    $257, state(%rip)
    ret
    .cfi_endproc
.LFE1:
    .size   swfix1, .-swfix1
    .p2align 4,,15
    .globl  swfix2
    .type   swfix2, @function
swfix2:
.LFB2:
    .cfi_startproc
    movzbl  %dil, %edi
    jmp *swvec.1979(,%rdi,8)
    .p2align 4,,10
    .p2align 3
.L23:
    movl    $260, state(%rip)
    ret
    .p2align 4,,10
    .p2align 3
.L22:
    movl    $259, state(%rip)
    ret
    .p2align 4,,10
    .p2align 3
.L21:
    movl    $258, state(%rip)
    ret
    .p2align 4,,10
    .p2align 3
.L19:
    movl    $257, state(%rip)
    ret
    .cfi_endproc
.LFE2:
    .size   swfix2, .-swfix2
    .section    .rodata
    .align 32
    .type   swvec.1979, @object
    .size   swvec.1979, 2048
swvec.1979:
    .zero   80
    .quad   .L21
    .zero   216
    .quad   .L23
    .zero   208
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .zero   24
    .quad   .L22
    .zero   16
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .quad   .L19
    .zero   1064
    .align 32
    .type   swvec.1969, @object
    .size   swvec.1969, 40
swvec.1969:
    .quad   .L17
    .quad   .L12
    .quad   .L14
    .quad   .L15
    .quad   .L16
    .align 32
    .type   type, @object
    .size   type, 256
type:
    .zero   10
    .byte   2
    .zero   27
    .byte   4
    .zero   26
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .zero   3
    .byte   3
    .zero   2
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .byte   1
    .zero   133
    .comm   state,4,4
    .ident  "GCC: (GNU) 8.3.1 20190223 (Red Hat 8.3.1-2)"
    .section    .note.GNU-stack,"",@progbits

With "with a hundred or so entries.", perhaps use the index into a function array instead of a `switch`. Of course it depends on what the original `case`s were doing. — chux - Reinstate Monica, Feb 26 '23 at 04:48

C/C++ use a switch case match all alphabet

5 Answers5

Linked