4

I would like to count the number of strings in a document.

If input is:

GGTGGTGGTAT
GGTAGTGGTAT
GGTGGTGGTAT
GGTAATGGTAT

And I search for GGTGGTGGT I would like to find 3 matches. Allowing for one ambiguity.

Using egrep it would look something like this and have an output of 3.

 egrep -c "GGTGGTGGT|.GTGGTGGT|G.TGGTGGT|GG.GGTGGT|GGT.GTGGT|GGTG.TGGT|GGTGG.GGT|GGTGGT.GT|GGTGGTG.T|GGTGGTGG." input
Stuber
  • 447
  • 5
  • 16

5 Answers5

4

Here's a way to generate that regex with bash:

$ patt=(GGTGGTGGT)
$ for ((i=0; i<${#patt[0]}; i++)); do 
    patt+=( "${patt[0]:0:i}.${patt[0]:i+1}" )
  done
$ regex=$(IFS='|'; echo "${patt[*]}")
$ echo "$regex"
GGTGGTGGT|.GTGGTGGT|G.TGGTGGT|GG.GGTGGT|GGT.GTGGT|GGTG.TGGT|GGTGG.GGT|GGTGGT.GT|GGTGGTG.T|GGTGGTGG.

and then:

awk -v regex="$regex" '$0 ~ regex' file

Or with awk only:

awk -v srch=GGTGGTGGT '
    BEGIN {
        regex = srch
        for (i=1; i<=length(srch); i++) 
            regex = regex "|" substr(srch,1,i-1) "." substr(srch, i+1)
    }
    $0 ~ regex
' << END
GGTGGTGGTAT
GGTAGTGGTAT
GGTGGTGGTAT
GGTAATGGTAT
END
GGTGGTGGTAT
GGTAGTGGTAT
GGTGGTGGTAT
glenn jackman
  • 238,783
  • 38
  • 220
  • 352
  • Other answers also worked, however was able to achieve the desired result much faster working bash to $regex then `egrep -c "$regex" file`. – Stuber Feb 03 '15 at 14:21
2

This awk executable script will create the patterns to match on, then test each line to count matches:

#!/usr/bin/awk -f

BEGIN { createPatternArray( pattern, a ) }

{
    for( k in a ) { if( $0 ~ k ) { total++; break } }
}

END { print total }

function createPatternArray( pattern, a,       pLen, i ) {
    a[pattern]
    pLen = length( pattern )
    for(i=1; i<=pLen; i++) {
        a[substr(pattern,1,i-1) "." substr(pattern,i+1)]
    }
    # for( k in a ) { print k }
}

If it was placed in a file like awko (and made executable), then running it on the data is like:

awko -v pattern=GGTGGTGGT data
3

The createPatternArray function makes entries in the array like:

.GTGGTGGT
G.TGGTGGT
GG.GGTGGT
GGT.GTGGT
GGTG.TGGT
GGTGG.GGT
GGTGGT.GT
GGTGGTG.T
GGTGGTGG.
GGTGGTGGT

For each line, the prefix of the line is tested against the entries in the array. If there's a match, increment totals and then break ( there are multiple matches otherwise ). At the END, print the total.

n0741337
  • 2,474
  • 2
  • 15
  • 15
  • You don't have to include `a,pLen, i` in the argument list as they aren't passed to the function. Actually as all variables are global in awk you don't even need an argument list. –  Feb 03 '15 at 08:43
  • @JID - The variables are scoped locally to the function by naming them in the args list. In the function, `pattern` is a copy of the global `pattern` and the array `a` is passed by reference. Because extra parameters aren't passed for `pLen` and `i`, they are merely local, null initialized variables. It would certainly work to not pass args and use them as globals but I chose not to here. – n0741337 Feb 03 '15 at 20:27
2

Here is a way using (G)awk and the gensub function

awk -va="GGTGGTGGT" '
        {for(i=1;i<=length(a);i++)if($0~gensub(/./,".",i,a)){print;next}}' file

Output

GGTGGTGGTAT
GGTAGTGGTAT
GGTGGTGGTAT

How it works

-va="GGTGGTGGT"

Sets the variable a to the value enclosed in the quotes(whatever you want)

{for(i=1;i<=length(a);i++)

Creates a loop from 1 to the length of the variable a.The length is the number of character inside the string.

if($0~gensub(/./,".",i,a))

I'll explain the gensub first.
The first two args swap .(any character) with a literal .. The 3rd argument is the occurrence of the match from argument 1. As we are searching for a single character, then this will just move through the string replacing each character with a .. The final arg is the string to edit and a is used. gensub also returns the string instead of editing the original.

$0~ 

Means the whole line contains whatever follows the ~

These are both contained in an if which when both evaluated will result in

$0~.GTGGTGGT
$0~G.TGGTGGT
$0~GG.GGTGGT
$0~GGT.GTGGT
$0~GGTG.TGGT
$0~GGTGG.GGT
$0~GGTGGT.GT
$0~GGTGGTG.T
$0~GGTGGTGG.

'

{print;next}

If any of those match then the line is print and all further instructions are skipped and the next line is processed.


Resources

https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html

Community
  • 1
  • 1
1

The necessary awk pattern is the same as your egrep solution:

awk '/GGTGGTGGT|.GTGGTGGT|G.TGGTGGT|GG.GGTGGT|GGT.GTGGT|GGTG.TGGT|GGTGG.GGT|GGTGGT.GT|GGTGGTG.T|GGTGGTGG./{print $0}' input
Steve Vinoski
  • 19,847
  • 3
  • 31
  • 46
  • Looking for a solution to input just search string "GGTGGTGGT" and awk handles the rest. Possibly creating the drawn out search string including the "." and then performing the search. – Stuber Feb 03 '15 at 00:56
  • You might consider clarifying your question in that case. – Steve Vinoski Feb 03 '15 at 01:01
0

What you really want is agrep which stands for approximate grep. It works incredibly well and is sometimes even faster than regular grep.

You can find the original here.
Installing is as simple as downloading the tar ball, running tar -xf <file>, navigating inside the resulting folder, and running make

Or the current (and possibly more bloated) version here

In your case you would simply:

agrep -1 GGTGGTGGT <file>

The -# is the number of mismatches you would like to allow. The original version supports up to 8 mismatches.

It is important to note that agrep consideres a 'mismatch' to be either an insertion, deletion, or substitution. So matches with one fewer or one extra character than the pattern string are considered, while all of the other answers here require the match to have the same number of characters.

Community
  • 1
  • 1
Cole
  • 600
  • 6
  • 12