1

okay so I wish to keep lines containing several keywords,

example of list:

Name:email:username #registered
Name2:email2:username2
Name3:email3:username3 #registered #subscribed #phonever
Name4:email4:username4 #unconfirmed

What I want to do is extract lines if they contain " #registered, #subscribed, #phonever

example of output I want,

Name:email:username #registered
Name3:email3:username3 #registered #subscribed #phonever
user3255841
  • 113
  • 1
  • 2
  • 8

3 Answers3

4

With awk (use regex alternation operator, |, on a list of fixed strings):

awk '/#registered|#subscribed|#phonever/' file

The part under /.../ is called an awk pattern and for the matching lines it executes the action that follows (as { ... }). But since the default action is: { print $0 } (printing the complete input record/line), there's no need to specify it here.

Similarly with sed you could say:

sed -nE '/#registered|#subscribed|#phonever/p' file

but now we have to specify -n to skip printing by default, and print with the p command only those lines that match the pattern (called sed address). The -E tells sed to used POSIX ERE (extended regex), and we need it here, because the default, POSIX BRE (basic regex) does not define the alternation operator.

For simple filtering (and printing the lines that match some pattern), grep is also an option (and a very fast option at that):

grep '#registered\|#subscribed\|#phonever' file

A bit more general solution (awk with patterns file)

Solution for larger (and possibly dynamic) lists of patterns could be to keep all patterns in a separate file, for example in patterns:

#registered
#subscribed
#phonever

and to use this awk program:

awk 'NR==FNR { pat[$0]=1 } NR>FNR { for (p in pat) if ($0 ~ p) {print;next} }' patterns file

which will first load all patterns into pat array, and then try to match any of those patterns on each of the lines in file, printing and advancing on to the next line on the first match found.

The result is the same:

Name:email:username #registered
Name3:email3:username3 #registered #subscribed #phonever

but the script now doesn't change for each new set of patterns. Note however, this caries a performance penalty (as general solutions usually do). For shorter lists of patterns and smaller files, this shouldn't be a problem.


And a lot faster variant of the above (grep with fixed-string patterns file)

Building on the approach from above (of keeping a list of fixed-string "patterns" in a file), we can actually use grep -- which provides a specialized option (-f FILE) for obtaining patterns from file, one per line. To further speed-up the matching, we should also use -F/--fixed-strings option.

So, this:

grep -Ff patterns file

will be incredibly fast, handling long lists of fixed-string patterns and huge files with minimal memory overhead.

randomir
  • 17,989
  • 1
  • 40
  • 55
  • this works thank you :), but if the list is large it doesn't work, I have to reduce to 3 keywords rather than 10 or 20 etc, not sure if that's because i copy and pasted rather than typed – user3255841 Aug 06 '17 at 21:18
  • It should work for larger lists too. But if your list of patterns is in a file, the `awk` script can be modified to handle that, without the need to retype it. If you want I can add it to my answer. – randomir Aug 06 '17 at 21:34
  • Added a solution for dynamic lists of patterns. – randomir Aug 06 '17 at 21:52
  • 1
    Thank you Randomir pat works great :), really appreciate your help. – user3255841 Aug 06 '17 at 22:19
  • 1
    Alteratively: `awk 'NR==FNR { re=(NR>1?re "|":"") $0; next } $0~re' regexps file`. I hate the word `pattern` as it's so ambiguous - we should just use "string" or "regexp" (or "globbing pattern"), whichever we mean. – Ed Morton Aug 06 '17 at 23:56
  • 1
    @EdMorton, nice, I didn't think of that. What do you think would be faster for a large list of keywords (not regexps)? – randomir Aug 07 '17 at 00:16
  • I'd think the `|`-separated regexp. It might actually be fastest to split the `#`-strings and do a hash lookup (`in`) on each line. I'll post a solution based on that and anyone who likes can try it. – Ed Morton Aug 07 '17 at 00:57
  • @randomir I added a solution at https://stackoverflow.com/a/45538112/1745001 that involves generating all possible combinations of the key words first and then just doing a hash lookup of each line of the input file to see if any of those combinations are present. Enjoy :-). – Ed Morton Aug 07 '17 at 13:26
  • @EdMorton, as much as your solution is interesting, I think, in this use case, we can't beat `grep` (see my edit) with `awk`. :) – randomir Aug 07 '17 at 14:59
  • The grep approach will fail when one of the target values appears elsewhere on the line or can appear as part of a different value e.g. if a "username" contains #phonenum or #foo is a target string and #foobar can also be present. – Ed Morton Aug 07 '17 at 15:05
  • @EdMorton, that's true; each solution has some assumptions. (Like your solution will fail if any of the fields contains a space.) – randomir Aug 07 '17 at 15:10
0

Simple awk approach:

awk '/#(registered|subscribed|phonever)/' file

The output:

Name:email:username #registered
Name3:email3:username3 #registered #subscribed #phonever

  • (registered|subscribed|phonever) - regexp alternation group to match a single regular expression out of several possible regular expressions
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
0
$ cat tst.awk
NR==FNR {
    strings[$0]
    next
}
{
    for (i=2; i<=NF; i++) {
        if ($i in strings) {
            print
            next
        }
    }
}

$ awk -f tst.awk strings file
Name:email:username #registered
Name3:email3:username3 #registered #subscribed #phonever

$ cat strings
#registered
#subscribed
#phonever

$ cat file
Name:email:username #registered
Name2:email2:username2
Name3:email3:username3 #registered #subscribed #phonever
Name4:email4:username4 #unconfirmed

If your file was huge and your set of target words relatively small and speed of execution was important to you then you could do this to generate every possible combination of every possible non-empty subset of those target words:

$ cat subsets.awk
###################
# Calculate all subsets of a given set, see
# https://en.wikipedia.org/wiki/Power_set

function get_subset(A,subsetNr,numVals, str, sep) {
    while (subsetNr) {
        if (subsetNr%2 != 0) {
            str = str sep A[numVals]
            sep = " "
        }
        numVals--
        subsetNr = int(subsetNr/2)
    }
    return str
}

function get_subsets(A,B,       i,lgth) {
    lgth = length(A)
    for (i=1;i<2^lgth;i++) {
        B[get_subset(A,i,lgth)]
    }
}

###################

# Input should be a list of strings
{
    split($0,A)
    delete B
    get_subsets(A,B)
    for (subset in B) {
        print subset
    }
}

.

$ cat permutations.awk
###################
# Calculate all permutations of a set of strings, see
# https://en.wikipedia.org/wiki/Heap%27s_algorithm

function get_perm(A,            i, lgth, sep, str) {
    lgth = length(A)
    for (i=1; i<=lgth; i++) {
        str = str sep A[i]
        sep = " "
    }
    return str
}

function swap(A, x, y,  tmp) {
    tmp  = A[x]
    A[x] = A[y]
    A[y] = tmp
}

function generate(n, A, B,      i) {
    if (n == 1) {
        B[get_perm(A)]
    }
    else {
        for (i=1; i <= n; i++) {
            generate(n - 1, A, B)
            if ((n%2) == 0) {
                swap(A, 1, n)
            }
            else {
                swap(A, i, n)
            }
        }
    }
}

function get_perms(A,B) {
    generate(length(A), A, B)
}

###################

# Input should be a list of strings
{
    split($0,A)
    delete B
    get_perms(A,B)
    for (perm in B) {
        print perm
    }
}

.

$ echo '#registered #subscribed #phonever' |
    awk -f subsets.awk |
    awk -f permutations.awk
#registered #subscribed #phonever
#subscribed #phonever #registered
#phonever #subscribed #registered
#phonever #registered #subscribed
#subscribed #registered #phonever
#registered #phonever #subscribed
#phonever
#subscribed
#registered #subscribed
#subscribed #registered
#registered
#registered #phonever
#phonever #registered
#subscribed #phonever
#phonever #subscribed

and then you could make the rest of the processing just a simple hash lookup:

$ echo '#registered #subscribed #phonever' |
    awk -f subsets.awk |
    awk -f permutations.awk |
    awk 'NR==FNR{strings[$0];next} {k=(NF>1?$0:"");sub(/[^ ]+ /,"",k)} k in strings' - file
Name:email:username #registered
Name3:email3:username3 #registered #subscribed #phonever
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • It's a cute prototype, but totally unpractical for more than 5-6 strings/patterns. With a factorial explosion around 8-9 strings -- for 10 strings it generates `~10M` combinations with `>600MB` memory overhead. And that's only the generation part. :) – randomir Aug 07 '17 at 14:58
  • For 10 strings it took 2 mins to generate the set of 9,864,100 combinations. Might still be faster than looping through the strings for every line of the input file depending on the size of the input file. As I mentioned it's for the case where the input file is large and set of strings small. – Ed Morton Aug 07 '17 at 15:11
  • You did mention it, I just wanted to emphasize the magnitude of "small" we're talking about. – randomir Aug 07 '17 at 15:12