Why doesn't this regex catch the periods correctly?

Question

I'm fiddling around trying to learn more about shell scripting. So, I have some files with email in them that spamassassin writes to a directory, and I thought I would try to do some regex matching on them. So, I select files that require different matches and then try to sort through them.

I wrote this script:

#!/usr/local/bin/bash
#
regex='(\.)?'
files="/var/spool/spam/testing/out.*"
for i in $files; do
domain=`cat $i | grep -i "Message-ID: <" | cut -d'@' -f2 | cut -d'>' -f1 | cut -d' ' -f1`
echo "Domain is $domain"
echo "We're starting the if loop"
if [ -z "$domain" ];
then
echo "Domain is empty"
echo $i
#rm $i
elif ! [[ "$domain" =~ $regex ]];
then
echo "There are no periods in the domainname $domain"
elif [[ $domain =~ $regex ]];
then
echo "There are periods in the domainname $domain"
fi
done

What I'm trying to accomplish is separate the domain part of Message-ID: and then determine what that domain is. Some Message-IDs have no domain at all. Some have bogus domains. Some have domains like this: yahoo.co.uk.

Every message has two Message-ID: entries, so the domain names end up appearing twice.

When I run this script on two files, this is the result I get:

# bash /usr/local/bin/rm-bounces.sh 
Domain is xbfoqrka
xbfoqrka
We're starting the if loop
There are periods in the domainname xbfoqrka
xbfoqrka
Domain is SKY-20150201SFT.com
SKY-20150201SFT.com
We're starting the if loop
There are periods in the domainname SKY-20150201SFT.com
SKY-20150201SFT.com

What I don't understand is why xbfoqrka matches the regex that supposed to find periods in the domain name but doesn't match the regex that looks for NO periods in the domain name. I'm escaping the period, so it should be an exact match, and there's no period in xbfoqrka xbfoqrka.

score 1 · Answer 1 · answered Aug 24 '15 at 03:00

1

The ? symbol means zero or one. So the regexp is looking for at least zero or one . in the text. Since there's no . in xbfoqrka then the regex finds a match (for zero).

Note that the regex will return true for any number of . - zero, one, three, 100 etc. That's because a string with 100 dots have at least zero or one dots.

answered Aug 24 '15 at 03:00

slebetman

109,858
19
140
171

So I should use no modifier at all? – Paul Schmehl Aug 24 '15 at 03:13
@PaulSchmehl: If the purpose is merely to detect the presence of `.`, then the correct regexp is `'\.'`. The `()` are also useless in this case (though they're mostly harmless) – slebetman Aug 24 '15 at 03:23
How can you do one, and only one. Two and only two? – Paul Schmehl Aug 24 '15 at 03:23
One and only one is: `^[^.]*\.[^.]*$` -- basically, a string that starts with zero or more characters that's not a dot followed by one dot followed by zero or more characters that's not a dot (change them to one or more if you want to eliminate strings that start or end with dots) – slebetman Aug 24 '15 at 03:26
Two and only two can be done as a variation of the above: `^([^.]*\.){2}[^.]*$` -- that is, a string that starts with two sequences of zero or more non-dots followed by one dot then zero or more non-dots – slebetman Aug 24 '15 at 03:28
Thanks. Regex in shell scripting is a lot more difficult than in perl, I have found. – Paul Schmehl Aug 24 '15 at 03:37

Why doesn't this regex catch the periods correctly?

1 Answers1