2

I have a little problem with my script. My program receives a string from the user and adds it together to make one large string in a loop that will only end if user types asterisk (*) somewhere in the code. Later on that code counts letters, numbers and non alpha numeric characters separately. It uses a combination of grep [0-9] | wc. However outputs always gets a little crazy, I give a couple of string examples.

  • .* = 0 numbers 7 letters 0 special

  • a1 = 2 number 2 letters = 0 special

  • abc123* = 4 numbers 4 letters 0 special

  • abc123...* = 4 numbers 4 letters 4 special

  • .....***** = 0 numbers = letters 6 special

In other words, it tries to add one (I assumed it might be related to the use of asterisk, but I couldn't have dealt with it), but when I only type asterisk, it comes out with crazy stuff.

echo $completestring | grep -o "[0-9]*" | wc -c
echo $completestring | grep -o "[a-zA-Z]*" | wc -c
echo $completestring | grep -o "[,._+:@%/-]*" | wc -c
$completestring contains a string written by the user
Ruslan Osmanov
  • 20,486
  • 7
  • 46
  • 60
jakubek278
  • 41
  • 1
  • 2
  • 8

3 Answers3

2

Asterisk

The asterisk (*) matches the preceding character or a group zero or more times. Thus

  • [0-9]* matches anything, i.e. a digit zero or more times;
  • [a-zA-Z]* matches anything, i.e. a character from the range zero or more times.

If you want to match a prefix plus zero or more characters, use .* expression, e.g.:

  • [0-9].*;
  • [a-zA-Z].*.

The dot (.) matches a single character.


Some tests:

$ echo 'test' | grep '[0-9].*'; echo $?
1
$ echo 'test' | grep '[0-9]*'; echo $?
test
0

The exit status ($?) is 0, if a line is selected, 1 if no lines were selected.

Quoting

Also note, you should enclose the shell variables in double quotes, if you want to prevent reinterpretation of the special characters: "$myvar".

Counting the number of pattern matches

Grep's -o option prints only the matched non-empty parts of a matching line, with each such part on a separate line. Thus the count of matching parts equals to the number of lines in the output. So you need wc -l instead:

$ echo 'abc123' | grep -o '[a-z]' | wc -l 
3

$ echo 'abc123def' | grep -o '[a-z]\+' 
abc
def
Ruslan Osmanov
  • 20,486
  • 7
  • 46
  • 60
2

If you want to count number of instances of particular type of characters you can do the following:

echo $completestring | grep -o "[0-9]" | wc -l
echo $completestring | grep -o "[a-zA-Z]" | wc -l
echo $completestring | grep -o "[,._+:@%/-]" | wc -l

This will for example give you the following output for the given complete string:

completestring="foo@a321abcdr%20:/mango/25b"

echo $completestring | grep -o "[0-9]" | wc -l
7

grep matches: 3 2 1 2 0 2 5

echo $completestring | grep -o "[a-zA-Z]" | wc -l
15

grep matches: f o o a a b c d r m a n g o b

echo $completestring | grep -o "[,._+:@%/-]" | wc -l
5

grep matches: @ % : / /

If you want to count clusters of numbers and words as a single instance (e.g. mango should be 1 not 5 and 321 should be counted as 1 number not 3) then you can use something like:

echo $completestring | grep -o "[0-9][0-9]*" | wc -l
echo $completestring | grep -o "[a-zA-Z][a-zA-Z]*" | wc -l

I think the special character count is on a per character basis.

Ahmed Masud
  • 21,655
  • 3
  • 33
  • 58
  • Thanks for your help. For some reason I thought wc -l was a way to count how many lines are in the file. I also didn't understood the asterisk in grep but I was scared to delete it because without it wc -c was giving even weirder results. – jakubek278 Dec 07 '16 at 06:56
  • Doing echo .* (unquoted, maybe you are using zsh) may also be a source of problems. Using bash internal expansion `${a//[^a-z]}` may be faster (and simpler). –  Dec 07 '16 at 08:04
  • The `*` in a regular expression has nothing to do with an `*` in the *haystack* you're searching. So your input string *can* have an `*` and you can actually `grep` for it. e.g. `echo $completestring | grep '\*'`. The single quotes are "shell" specific, and the backslash (\\) is passed on to grep, so that it will skip over the _special meaning_ of a `*` – Ahmed Masud Dec 08 '16 at 14:34
1

There are several issues with your idea.

First, Please, please, by all means: quote your variable expansions.

  1. Quote This is what happens here in some directory:

    $ completestring=.*    ;   echo $completestring
    . .. .directory .#screenon
    

    Instead, I believe you want:

    $ completestring=.* ; echo "$completestring" .*

  2. Using wc will count bytes, not characters (close to UNICODE code points). Example (in a console set for utf-8, almost all nowadays):

    $ echo "école" | wc -c 
    7
    
    $ echo "ß" | wc -c
    3
    
  3. Also, wc is counting the trailing new line.

    $ echo "123" | wc -c
    4
    

    You need to use echo -n (non-portable, not recommended) or printf '%s'

    $ printf '%s' "123" | wc -c
    3
    
  4. Using an asterisk with grep makes it print runs of characters in each line:

    $ completestring="jkfdsnlal92845t02u74ijopzidjb jd"
    
    $ echo $completestring | grep -o [0-9]*
    92845
    02
    74
    

    There is no simple way to count that. A simplification is to use just the range:

    $ echo $completestring | grep -o [0-9]
    9
    2
    8
    4
    5
    0
    2
    7
    4
    

    And then you can count lines:

    $ echo $completestring | grep -o [0-9] | wc -l
    9
    

    Note: I'll use only a as variable from here on.
    Is easier to type, hope you understand :).

    echo $completestring | grep -o [0-9]*
    
  5. You should avoid including the * asterisk in the string under test if that is used for the end of the input. Depending on how you are reading the variable, maybe you can use Ctrl-D to signal an EOF to the system to end reading input from the user.

Using full bash:

But we can do all what we need with simple bash constructs:

$ a="jkfdsnlal92845t02u74ijopzidjb jd"
$ b="${#a//[^0-9]}"                       # remove all characters 
                                          # that are not decimal digits

$ echo "${b}"                             # Not really needed, but this  
928450274                                 # what var b contains.

$ echo "${#b}"                            # Print the length of var b.
9

What you wrote in your code could be translated to this (the / needs to be quoted as \/ and I included the * in the special list).

completestring=abc123*
dig=${completestring//[^0-9]}; dig=${#dig}
alpha=${completestring//[^a-zA-Z]}; alpha=${#alpha}
special=${completestring//[^,._+:@%\/*-]}; special=${#special}
echo "Digits=$dig  Alpha=$alpha  Special=$special"

Will print

Digits=3  Alpha=3  Special=1

LC_COLLATE

There is a gotcha with this system, however.
It will count many UNICODE characters as well:

$ c=aßbéc123*; a=${c//[^a-zA-Z]}; echo "string=$a    count=${#a}"
string=aßbéc    count=5

I believe that this is what you need.

But if you must limit to the 128 ascii characters, change LC_ALL or more specifically LC_COLLATE to the C locale when executing the range selection:

$ (LCcompletestring=abc123*; alpha=${completestring//[^a-zA-Z]}; alpha=${#alpha}; echo "${alpha}"_COLLATE=C a=${c//[^a-zA-Z]}; echo "string=$a    count=${#a}")
string=abc    count=3

The (…) is to use a sub-shell and avoid setting LC_COLLATE in the whole shell.
However you may set it at the start of your script and it will also work.

This got long, sorry. But anyuway: Am I missing something still?

Well, yes, I hope your passwords will not be including control characters (C0: ASCII from 1 to 31 and 127, and C1: 128 to 159). Because counting them has several twists. Probably outside of this answer.