Bash counting letters in string, output always a little different

Question

I have a little problem with my script. My program receives a string from the user and adds it together to make one large string in a loop that will only end if user types asterisk (*) somewhere in the code. Later on that code counts letters, numbers and non alpha numeric characters separately. It uses a combination of grep [0-9] | wc. However outputs always gets a little crazy, I give a couple of string examples.

.* = 0 numbers 7 letters 0 special
a1 = 2 number 2 letters = 0 special
abc123* = 4 numbers 4 letters 0 special
abc123...* = 4 numbers 4 letters 4 special
.....***** = 0 numbers = letters 6 special

In other words, it tries to add one (I assumed it might be related to the use of asterisk, but I couldn't have dealt with it), but when I only type asterisk, it comes out with crazy stuff.

echo $completestring | grep -o "[0-9]*" | wc -c
echo $completestring | grep -o "[a-zA-Z]*" | wc -c
echo $completestring | grep -o "[,._+:@%/-]*" | wc -c
$completestring contains a string written by the user

the extra count is related to new line being counted because you're using `-c` for `wc` also if you match `[0-9]*` matches `""` — Ahmed Masud, Dec 07 '16 at 06:38

Ruslan Osmanov · Answer 1 · 2016-12-07T06:45:29.097

2

Asterisk

The asterisk (*) matches the preceding character or a group zero or more times. Thus

[0-9]* matches anything, i.e. a digit zero or more times;
[a-zA-Z]* matches anything, i.e. a character from the range zero or more times.

If you want to match a prefix plus zero or more characters, use .* expression, e.g.:

[0-9].*;
[a-zA-Z].*.

The dot (.) matches a single character.

Some tests:

$ echo 'test' | grep '[0-9].*'; echo $?
1
$ echo 'test' | grep '[0-9]*'; echo $?
test
0

The exit status ($?) is 0, if a line is selected, 1 if no lines were selected.

Quoting

Also note, you should enclose the shell variables in double quotes, if you want to prevent reinterpretation of the special characters: "$myvar".

Counting the number of pattern matches

Grep's -o option prints only the matched non-empty parts of a matching line, with each such part on a separate line. Thus the count of matching parts equals to the number of lines in the output. So you need wc -l instead:

$ echo 'abc123' | grep -o '[a-z]' | wc -l 
3

$ echo 'abc123def' | grep -o '[a-z]\+' 
abc
def

edited Dec 07 '16 at 06:45

answered Dec 07 '16 at 06:09

Ruslan Osmanov

20,486
7
46
60

I don't think your reply answers the OP question – Ahmed Masud Dec 07 '16 at 06:26
@AhmedMasud, why? – Ruslan Osmanov Dec 07 '16 at 06:27
I might get it wrong, but your solution shows whenever or not grep finds something in a string. I try to count it all, for example: if I have string "aabbcc123123....*" Then I'd expect my code to say: 6 letters, 6 numbers, 5 special (including asterix) – jakubek278 Dec 07 '16 at 06:38
@jakubek278, your question is "why you are getting "crazy" results", right? So I have described why. – Ruslan Osmanov Dec 07 '16 at 06:40
You are right, I haven't even realized that I haven't stated my question correctly, sorry. my bad. – jakubek278 Dec 07 '16 at 06:54
@sorontar, that's why I have mentioned quoting in the answer – Ruslan Osmanov Dec 07 '16 at 08:01
Sorry, comment in wrong answer. In any case, maybe `${a//[^0-9]}` could be used. – Dec 07 '16 at 08:06

Ahmed Masud · Answer 2 · 2016-12-07T06:43:11.697

2

If you want to count number of instances of particular type of characters you can do the following:

echo $completestring | grep -o "[0-9]" | wc -l
echo $completestring | grep -o "[a-zA-Z]" | wc -l
echo $completestring | grep -o "[,._+:@%/-]" | wc -l

This will for example give you the following output for the given complete string:

completestring="foo@a321abcdr%20:/mango/25b"

echo $completestring | grep -o "[0-9]" | wc -l
7

grep matches: 3 2 1 2 0 2 5

echo $completestring | grep -o "[a-zA-Z]" | wc -l
15

grep matches: f o o a a b c d r m a n g o b

echo $completestring | grep -o "[,._+:@%/-]" | wc -l
5

grep matches: @ % : / /

If you want to count clusters of numbers and words as a single instance (e.g. mango should be 1 not 5 and 321 should be counted as 1 number not 3) then you can use something like:

echo $completestring | grep -o "[0-9][0-9]*" | wc -l
echo $completestring | grep -o "[a-zA-Z][a-zA-Z]*" | wc -l

I think the special character count is on a per character basis.

edited Dec 07 '16 at 06:43

answered Dec 07 '16 at 06:36

Ahmed Masud

21,655
3
33
58

Thanks for your help. For some reason I thought wc -l was a way to count how many lines are in the file. I also didn't understood the asterisk in grep but I was scared to delete it because without it wc -c was giving even weirder results. – jakubek278 Dec 07 '16 at 06:56
Doing echo .* (unquoted, maybe you are using zsh) may also be a source of problems. Using bash internal expansion `${a//[^a-z]}` may be faster (and simpler). – Dec 07 '16 at 08:04
The `*` in a regular expression has nothing to do with an `*` in the *haystack* you're searching. So your input string *can* have an `*` and you can actually `grep` for it. e.g. `echo $completestring | grep '\*'`. The single quotes are "shell" specific, and the backslash (\\) is passed on to grep, so that it will skip over the _special meaning_ of a `*` – Ahmed Masud Dec 08 '16 at 14:34

score 1 · Answer 3 · 2016-12-07T08:25:34.540

There are several issues with your idea.

First, Please, please, by all means: quote your variable expansions.

Quote This is what happens here in some directory:
```
$ completestring=.*    ;   echo $completestring
. .. .directory .#screenon
```
Instead, I believe you want:

$ completestring=.* ; echo "$completestring" .*
Using wc will count bytes, not characters (close to UNICODE code points). Example (in a console set for utf-8, almost all nowadays):
```
$ echo "école" | wc -c 
7

$ echo "ß" | wc -c
3
```
Also, wc is counting the trailing new line.
```
$ echo "123" | wc -c
4
```
You need to use echo -n (non-portable, not recommended) or printf '%s'
```
$ printf '%s' "123" | wc -c
3
```
Using an asterisk with grep makes it print runs of characters in each line:
```
$ completestring="jkfdsnlal92845t02u74ijopzidjb jd"

$ echo $completestring | grep -o [0-9]*
92845
02
74
```
There is no simple way to count that. A simplification is to use just the range:
```
$ echo $completestring | grep -o [0-9]
9
2
8
4
5
0
2
7
4
```
And then you can count lines:
```
$ echo $completestring | grep -o [0-9] | wc -l
9
```
Note: I'll use only a as variable from here on.
Is easier to type, hope you understand :).
```
echo $completestring | grep -o [0-9]*
```
You should avoid including the * asterisk in the string under test if that is used for the end of the input. Depending on how you are reading the variable, maybe you can use Ctrl-D to signal an EOF to the system to end reading input from the user.

Using full bash:

But we can do all what we need with simple bash constructs:

$ a="jkfdsnlal92845t02u74ijopzidjb jd"
$ b="${#a//[^0-9]}"                       # remove all characters 
                                          # that are not decimal digits

$ echo "${b}"                             # Not really needed, but this  
928450274                                 # what var b contains.

$ echo "${#b}"                            # Print the length of var b.
9

What you wrote in your code could be translated to this (the / needs to be quoted as \/ and I included the * in the special list).

completestring=abc123*
dig=${completestring//[^0-9]}; dig=${#dig}
alpha=${completestring//[^a-zA-Z]}; alpha=${#alpha}
special=${completestring//[^,._+:@%\/*-]}; special=${#special}
echo "Digits=$dig  Alpha=$alpha  Special=$special"

Will print

Digits=3  Alpha=3  Special=1

LC_COLLATE

There is a gotcha with this system, however.
It will count many UNICODE characters as well:

$ c=aßbéc123*; a=${c//[^a-zA-Z]}; echo "string=$a    count=${#a}"
string=aßbéc    count=5

I believe that this is what you need.

But if you must limit to the 128 ascii characters, change LC_ALL or more specifically LC_COLLATE to the C locale when executing the range selection:

$ (LCcompletestring=abc123*; alpha=${completestring//[^a-zA-Z]}; alpha=${#alpha}; echo "${alpha}"_COLLATE=C a=${c//[^a-zA-Z]}; echo "string=$a    count=${#a}")
string=abc    count=3

The (…) is to use a sub-shell and avoid setting LC_COLLATE in the whole shell.
However you may set it at the start of your script and it will also work.

This got long, sorry. But anyuway: Am I missing something still?

Well, yes, I hope your passwords will not be including control characters (C0: ASCII from 1 to 31 and 127, and C1: 128 to 159). Because counting them has several twists. Probably outside of this answer.