0

I've seen following program that uses custom toupper() function.

#include <stdio.h> 
void my_toUpper(char* str, int index)
{
    *(str + index) &= ~32;
}
int main()
{
    char arr[] = "geeksquiz";
    my_toUpper(arr, 0);
    my_toUpper(arr, 5);
    printf("%s", arr);
    return 0;
}

How this function works exactly? I can't understand logic behind it. It will be good If someone explains it easily.

Yu Hao
  • 119,891
  • 44
  • 235
  • 294
Destructor
  • 14,123
  • 11
  • 61
  • 126
  • Answers refer to ASCII. The algorithm does work with ASCII upper and lower case characters. However, is your compiler/program/user/terminal using ASCII? Is your data always going to be unaccented English letters? Part of the answer of how does it work should include its severe, unchecked limitations. – Tom Blodget Jun 28 '15 at 16:25

2 Answers2

3

Following the ASCII table, to convert a letter from lowercase to UPPERCASE, you need to subtract 32 from the ASCII value of the lowercase letter.

For the ASCII values representing the lowercase letters, subtracting 32, is equal to ANDing ~32. That is what being done in

 *(str + index) &= ~32;

It takes the value of the indexth member from the str, subtract 32 (bitwise AND with ~32, clears the particular bit value) and stores it back to the same index.

FWIW, this is a special case of "resetting" a particular bit to get the result of actually subtracting 32. This "subtraction" works here based on the particular bit representation of the lowercase letter ASCII values. As mentioned in the comments, this is not a general way of subtraction, as this "resetting" logic won't work on any value for subtraction.

Regarding the operators used,

  • &= is assignment by bitwise AND
  • ~ is bitwise NOT.

Note: This custom function lacks the error check for the (in)valid value present in str. You need to take care of that.

Sourav Ghosh
  • 133,132
  • 16
  • 183
  • 261
  • 2
    It isn't "subtracting 32", for example, this code won't turn `64` into `32`. – Blastfurnace Jun 28 '15 at 13:42
  • @Blastfurnace yes, there is no safety check, indeed. It assumes this is called with `str` with only valid lowercase letters. – Sourav Ghosh Jun 28 '15 at 13:43
  • 1
    @ARBY: This is only _clearing a specific bit_, don't assume it's a valid way to perform subtraction with two arbitrary numbers. – Blastfurnace Jun 28 '15 at 13:48
  • @Blastfurnace: But then, how clearing the 6th bit from the right is working here? How is it assuring a subtraction of 32? – Raman Jun 28 '15 at 13:52
  • 1
    @ARBY: It works in this case because 32 is a power of two and it relies on the arrangement of upper/lower case letters in the ASCII table. Again, this isn't a clever way to subtract an arbitrary value, it's just clearing a bit. – Blastfurnace Jun 28 '15 at 13:55
  • 1
    Actually, because this function is doing bitwise AND (_not_ subtracting), it's actually safe to call on upper-case ASCII letters as well as lowercase. (But not digits or most punctuation or anything.) – Steve Summit Jun 28 '15 at 15:23
2

To understand this, we have to look at the ASCII representations of letters. It's easiest to do this in base 2.

A  01000001        a  01100001
B  01000010        b  01100010
C  01000011        c  01100011
D  01000100        d  01100100
   ...                ...
X  01011000        x  01111000
Y  01011001        y  01111001
Z  01011010        z  01111010

Notice that the upper-case letters all begin with 010, and the lower-case letters all begin with 011. Notice that the lower-order bits are all the same for the upper- and lower-case versions of the same letter.

So: all we need to do to convert a lower-case letter to the corresponding upper-case letter is to change the 011 to 010, or in other words, turn off the 00100000 bit.

Now, the standard way to turn off a bit is to do a bitwise AND of a mask with a 0 in the position of the bit you want to turn off, and 1's everywhere else. So the mask we want is 11011111. We could write that as 0xdf, but the programmer in this example has chosen to emphasize that it's a complementary mask to 00100000 by writing ~32. 32 in binary is 00100000.

This technique works fine, except that it will do strange things with non-letters. For example, it will turn '{' into '[' (because they have the ASCII codes 01111011 and 001011011, respectively). It will turn an asterisk '*' into a newline '\n' (00101010 into 00001010).

The other way of converting upper to lower case in ASCII is to subtract 32. That, also, will convert 'a' to 'A' (97 to 65, in decimal), but if would also convert, for example, 'A' to '!'. The bitwise AND technique is actually advantageous in this case because it converts 'A' to 'A' (which is what a convert-to-uppercase routine ought to do).

The bottom line is that whether you AND with ~32 or subtract 32, in a properly safe function you're going to have to also check that the character being converted is the right kind of letter to begin with.

Also, it's worth noting that this technique absolutely assumes the 7-bit ASCII character set, and will not work with accented or non-Roman letters of other character sets, such as ISO-8859 or Unicode. (EBCDIC would be another matter.)

Steve Summit
  • 45,437
  • 7
  • 70
  • 103