How to find all the words that begin with a|b and end with a|b. (Ex: “adverb” and “balalaika”)

Question

The following perl program has a regex written to serve my purpose. But, this captures results present within a string too. How can I only get strings separated by spaces/newlines/tabs?

The test data I used is present below: http://sainikhil.me/stackoverflow/dictionaryWords.txt

use strict;
use warnings;

sub print_a_b {
    my $file = shift;

    $pattern = qr/(a|b|A|B)\S*(a|b|A|B)/;
    open my $fp, $file;

    my $cnt = 0;
    while(my $line = <$fp>) {
        if($line =~ $pattern) {
            print $line;
            $cnt = $cnt+1;
        }
    }
    print $cnt;
}

print_a_b @ARGV;

score 3 · Accepted Answer · edited May 23 '17 at 12:01

3

You could consider using an anchor like \b: word boundary

That would help apply the regexp only after and before a word.

 \b(a|b|A|B)\S*(a|b|A|B)\b

Simpler, as Avinash Raj adds in the comments:

(?i)\b[ab]\S*[ab]\b

(using the case insensitive flag or modifier)

edited May 23 '17 at 12:01

Community

1
1

answered Aug 14 '16 at 13:03

VonC

1,262,500
529
4,410
5,250

1

KISS `(?i)\b[ab]\S*[ab]\b` – Avinash Raj Aug 14 '16 at 13:06
2

More clearer form `(?i)(?<!\S)[ab]\S*[ab](?!\S)`, "How can I only get strings separated by spaces/newlines/tabs?" – Avinash Raj Aug 14 '16 at 13:10
@AvinashRaj I always try to avoid first lookaheads (positive, negative), but depending on the source material, it can make sense. At least there is no backtracking involve, apparently. – VonC Aug 14 '16 at 13:13
Hi @AvinashRaj: Thanks for the reply. How is it working exactly? – Sai Nikhil Aug 14 '16 at 13:14
@saint1729 test both the regexes in this site https://regex101.com/r/pV4iU3/1 . You should find the difference.. – Avinash Raj Aug 14 '16 at 13:15
@saint1729 on the (?<!\S) syntax, see http://www.regular-expressions.info/lookaround.html#lookbehind – VonC Aug 14 '16 at 13:16
`(?<!\S)` - Asserts that the match Not preceeded by a non-space character. `(?!\S)` asserts that the match not followed by a non-space character. – Avinash Raj Aug 14 '16 at 13:17
Okay. Thank You :) – Sai Nikhil Aug 14 '16 at 13:33
qr/^a.*a$/ is a regular expression that matches all lines that start and end with a. Is there a any generic way to find all the lines that begin and end with the same characters? – Sai Nikhil Aug 14 '16 at 14:58
@saint1729 - Generic = `(?m-s)^([a-z]).*\1$` – Aug 14 '16 at 20:33

Federico Piazza · Answer 2 · 2016-08-14T15:46:47.337

1

If you have multiple words in the same line then you can use word boundaries in a regex like this:

(?i)\b[ab][a-z]*[ab]\b

Regular expression visualization

The pattern code is:

$pattern = /\b[ab][a-z]*[ab]\b/i;

However, if you want to check for lines with only has a word, then you can use:

(?i)$[ab][a-z]*[ab]$

Update: for your comment * lines that begin and end with the same character*, you can use this regex:

(?i)\b([a-z])[a-z]*\1\b

But if you want any character and not letters only like above you can use:

(?i)\b(.)[a-z]*\1\b

edited Aug 14 '16 at 15:46

answered Aug 14 '16 at 15:30

Federico Piazza

30,085
15
87
123

Is there a any generic way to find all the lines that begin and end with the same character? Not just 'a' or 'b' – Sai Nikhil Aug 14 '16 at 15:44
@saint1729 use [backtrack](http://stackoverflow.com/q/9011592/995714). `\b([ab])\S*\1\b` – phuclv Aug 14 '16 at 15:46
@saint1729 I've updated the answer with the regex for your comment – Federico Piazza Aug 14 '16 at 15:47

How to find all the words that begin with a|b and end with a|b. (Ex: “adverb” and “balalaika”)

2 Answers2