How do I check whether character constants conform to ASCII?

Question

A comment on an earlier version of this answer of mine alerted me to the fact that I can't assume that 'A', 'B', 'C' etc. have successive numeric values. I had sort of assumed the C or C++ language standards guarantee that this is the case.

So, how should I determine whether consecutive letter characters' values are themselves consecutive? Or rather, how can I determine whether the character constants I can express within single quotes have their ASCII codes for a numeric value?

I'm asking how to do this both in C and in C++. Obviously the C way would work in C++ also, but if there's a C++ish facility for doing this I'm interested in that as well. Also, I'm asking about the newest relevant standards (C11, C++17).

"Or rather, how can I determine whether the character constants I can express within single quotes have their ASCII codes for a numeric value?" Why would you want to do that? — , Feb 05 '17 at 20:01
[std::isdigit](http://en.cppreference.com/w/cpp/string/byte/isdigit)? Also why both C++ and C tags? For which one is it? It would also be nice if you could clarify the version of the standard that you want the answer for. — tambre, Feb 05 '17 at 20:02
I doubt anyone on a system which uses a non-ASCII compatible encoding would ever even consider using code that was not written specifically with that system in mind. I would be surprised to find that C++ compilers even exist for such systems. — Benjamin Lindley, Feb 05 '17 at 20:03
@BenjaminLindley There are C++ compilers for non-ASCII computers. https://www.ibm.com/support/knowledgecenter/SSLTBW_2.1.0/com.ibm.zos.v2r1.cbclx01/charset.htm runs on EBCDIC. — Martin Bonner supports Monica, Feb 05 '17 at 20:16
A very simple check could be `if('A' == 65 && 'Z' - 'A' == 25) { ascii = true; }`. — Weather Vane, Feb 05 '17 at 20:20
Instead of engaging in ASCIIism, write code that doesn't care what the character set is. — Pete Becker, Feb 05 '17 at 20:56
@PeteBecker: (1) It's not my code, (2) I expect to be able to obtain the distance in number-of-letters between two letters, or ditto for digits, regardless of whether we're talking about ASCII or not. — einpoklum, Feb 05 '17 at 21:12
Could use `ascii = 'A' == 65 && 'B' == 66 ... 24 more)`. Sure its a long line but why not? — chux - Reinstate Monica, Feb 05 '17 at 21:54
@chux: But, surely, [there has to be another way!](https://www.youtube.com/watch?v=c8g4Ztf7hIM) — einpoklum, Feb 05 '17 at 21:59
I think your question is a bit vague... To answer it literally, have a set of tests (96 of them?) to check test that `'!' == 33 && 'a' == 97 && ......` . But depending on your goal there might be a shorter heuristic. — M.M, Feb 05 '17 at 23:12
@einpoklum Distance between letters depends on your alphabet. "'A', 'B', 'C' etc" _is_ vague. — Tom Blodget, Feb 06 '17 at 17:42

πάντα ῥεῖ · Accepted Answer · 2017-02-05T20:44:30.193

6

You can use the preprocessor to check if a particular character maps to the charset:

#include <iostream>
using namespace std;

int main() {
    #if ('A' == 65 && 'Z' - 'A' == 25)
    std::cout << "ASCII" << std::endl;
    #else
    std::cout << "Other charset" << std::endl;
    #endif
    return 0;
}

The drawback is, you need to know the mapped values in advance.

The numeric chars '0' - '9' are guaranteed to appear in consecutive order BTW.

edited Feb 05 '17 at 20:44

answered Feb 05 '17 at 20:05

πάντα ῥεῖ

1
13
116
190

Hmm, not the most robust check I can think of :-( – einpoklum Feb 05 '17 at 20:06
@einpoklum The best you can get though. – πάντα ῥεῖ Feb 05 '17 at 20:08
1

No reason you can't check against every single character. You could automatically generate that code pretty easily if you don't want to type it all out. – Benjamin Lindley Feb 05 '17 at 20:10
2

@BenjaminLindley: You *could* check every character, but in practice, one character will be good enough. – Martin Bonner supports Monica Feb 05 '17 at 20:18
1

Technically, ASCII is a *7* - bit character set. Prime minicomputers used ASCII, but *set* the 8th bit So 'A' was 193. It is up to the OP whether he wants to consider Primes as using ASCII or not (or whether he cares if his code won't run on Primes or not). – Martin Bonner supports Monica Feb 05 '17 at 20:21
2

@πάνταῥεῖ a more robust check could be `#if ('A' == 65 && 'Z' - 'A' == 25)` – Weather Vane Feb 05 '17 at 20:32
@WeatherVane Adopted, THX. – πάντα ῥεῖ Feb 05 '17 at 20:44
Nitpicking, I know, but that does not ensure ASCII, just that the uppercase letters correspond to ASCII. – Pete Becker Feb 05 '17 at 20:55
@PeteBecker Not even that - it only checks two of the uppercase letters – M.M Feb 05 '17 at 23:10
@MartinBonner thanks for that refresher. Shades of fun from the 1980s. It appears I've worked with three of the more confounding character codesets (EBCDIC for IBM & Siemens-then-Fujitsu BS2000, Prime, and Honeywell 36-bit words where you had either 9-bit ASCII (yes, 8 bits wasted) or 6-bit BCD (full usage)). – zarchasmpgmr Feb 06 '17 at 18:48

score 0 · Answer 2 · answered Feb 06 '17 at 01:22

0

... (2) I expect to be able to obtain the distance in number-of-letters between two letters ...

This comment specifying your goal makes much more sense than your actual question! Why didn't you ask about that? You can use strchr on an array of characters, and strchr doesn't care what the native character set is, meaning your code won't care what the native character set is... For example:

char alphabet[] = "AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz";
ptrdiff_t fubar = strchr(alphabet, 'y') - strchr(alphabet, 'X');
printf("'X' and 'y' have a distance of %tu and a case difference of %tu\n", fubar / 2, fubar % 2);

... how should I determine whether consecutive letter characters' values are themselves consecutive?

Consecutive letter characters' values are consecutive, by definition, because they're consecutive letter characters. I know this isn't what you meant, but your actual question illustrates a lack of planning and thought, and... a stupid question warrants a stupid answer.

You're much better off programming in such a way that you don't care what values they have. Nonetheless, create an array containing the characters you care about, loop through the elements and test for inconsistencies. For example:

int is_consecutive(char const *alphabet) {
    for (size_t x = 0; alphabet[x] && alphabet[x] + 1 == alphabet[x + 1]; x++);
    return !alphabet[x];
}

... how can I determine whether the character constants I can express within single quotes have their ASCII codes for a numeric value?

Again with the lack of sense, and again with the caring about values... Alternatively, build two translation tables, native_to_ascii and ascii_to_native, and work it out from there. I won't help you with this, as it's a silly exercise involving the use of magic numbers that most likely aren't necessary for your actual goal.

answered Feb 06 '17 at 01:22

autistic

1
3
35
80

What you wrote is what one would have to do when one can't assume `'Z' - 'A'` isn't the same value as in the actual alphabet. But - it's not an answer to my question. – einpoklum Feb 06 '17 at 08:34
Using `strchr` adds a lot of source-code complexity, generated code, and run-time work for what will be a simple subtract when using the two groups of 26 characters in the Latin alphabet on pretty much any production system. I'd regard it much more likely that characters other than the 52 Latin letters would pose trouble than that a non-ASCII system would do so, and strchr won't help with all the vagarities of Unicode. – supercat Feb 07 '17 at 22:18
@supercat In response to the beginning of your most recent comment, ["An actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no needed side effects are produced"](http://port70.net/~nsz/c/c11/n1570.html#5.1.2.3p4)... In regards to unicode, I'd consider using `char32_t` and a rewritten `strchr` (perhaps named `strchr32`) to operate upon it, as the unicode support in C is abysmal to begin with. Out of curiousity, in which unicode character set are these 52 Latin characters represented using a multibyte sequence? – autistic Feb 09 '17 at 00:12
@Seb: While it may be allowable for an implementation to optimize calls to `strchr`, and while some might do so, it is hardly a universal practice. The 52 characters of the Latin alphabet don't directly problems in Unicode, but code might be called upon to accept e.g. a Turkish uppercase "i" or lowercase "I", which are multi-byte characters. – supercat Feb 09 '17 at 05:24
@supercat ... and in such situations, wouldn't it be trivial to adapt this to use a different type (such as `char32_t`) and a revised version of `strchr`? This question asks *solely* about the alphabet characters, does it not? While I understand your desire, for a solution that solves all problems, this isn't always possible or practical... I'm presenting one of a few options. Don't like it? Think you can do better? By all means, do it! – autistic Feb 10 '17 at 07:13
@supercat Would you prefer the more popular answer? If so, how do you adapt that code to handle the Turkish example you've given? – autistic Feb 10 '17 at 07:15
@Seb: The only sane way to handle multi-byte characters is generally to use a Unicode library, since neither ordinal comparisons nor strchr will work. On the other hand, a lot of text-processing code is designed for machine-generated pure-ASCII text, in which case either approach would be fine. My point is that the dominant scenarios are ones in which either both will work or neither will work. For situations which would require handling non-consecutive character ranges (e.g. Base64 or Base85 decoding) I'd be inclined to use an inverse translation table rather than strchr unless... – supercat Feb 10 '17 at 15:32
...storage was at an absolute premium. – supercat Feb 10 '17 at 15:32

How do I check whether character constants conform to ASCII?

2 Answers2

Linked