How do I filter tab-separated input by the count of fields with a given value?

Question

My data(tab separated):

1   0   0   1   0   1   1   0   1
1   1   0   1   0   1   0   1   1
1   1   1   1   1   1   1   1   1
0   0   0   0   0   0   0   0   0
...

how can i grep the lines with exact, for example, 5 '1's, ideal output:

1   0   0   1   0   1   1   0   1

Also, how can i grep lines with equal or more than (>=) 5 '1's, ideal output:

1   0   0   1   0   1   1   0   1
1   1   0   1   0   1   0   1   1
1   1   1   1   1   1   1   1   1

i tried,

grep 1$'\t'1$'\t'1$'\t'1$'\t'1

however this will only output consecutive '1's, which is not all i want.

i wonder if there will be any simple method to achieve this, thank you!

Regular expressions are meant for matching patterns, not for counting. — Ken White, May 26 '16 at 03:10
`awk '{for(i=1;i<=NF;i++) {if(i==5{sum+=$i}}}if(sum==5)print $0}' file` would be untested, but close to a solution. Good luck. — shellter, May 26 '16 at 03:28
^^ `(i==5)` is wrong here. Plus one missing parenthesis, but I guess that one was typo. — anishsane, May 26 '16 at 03:34

score 4 · Accepted Answer · edited May 23 '17 at 11:52

John Bollinger's helpful answer and anishane's answer show that it can be done with grep, but, as has been noted, that is quite cumbersome, given that regular expression aren't designed for counting.

awk, by contrast, is built for field-based parsing and counting (often combined with regular expressions to identify field separators, or, as below, the fields themselves).

Assuming you have GNU awk, you can use the following:

Exactly 5 1s:

awk -v FPAT='\\<1\\>' 'NF==5' file

5 or more 1s:

awk -v FPAT='\\<1\\>' 'NF>=5' file

Special variable FPAT is a GNU awk extension that allows you to identify fields via a regex that describes the fields themselves, in contrast with the standard approach of using a regex to define the separators between fields (via special variable FS or option -F):
- '\\<1\\>' identifies any "isolated" 1 (surrounded by non-word characters) as a field, based on word-boundary assertions \< and \>; the \ must be doubled here so that the initial string parsing performed by awk doesn't "eat" single \s.
Standard variable NF contains the count of input fields in the line at hand, which allows easy numerical comparison. If the conditional evaluates to true, the input line at hand is implicitly printed (in other words: NF==5 is implicitly the same as NF==5 { print } and, more verbosely, NF==5 { print $0 }).

A POSIX-compliant awk solution is a little more complicated:

Exactly 5 1s:

awk '{ l=$0; gsub("[\t0]", "") }; length($0)==5 { print l }' file

5 or more 1s:

awk '{ l=$0; gsub("[\t0]", "") }; length($0)>=5 { print l }' file

l=$0 saves the input line ($0) in its original form in variable l.
gsub("[\t0]", "") replaces all \t and 0 chars. in the input line with the empty string, i.e., effectively removes them, and only leaves (directly concatenated) 1 instances (if any).
length($0)==5 { print l } then prints the original input line (l) only if the resulting string of 1s (i.e., the count of 1s now stored in the modified input line ($0)) matches the specified count.

anishsane · Answer 2 · 2016-05-26T03:37:10.563

2

You can use grep. But that would be an abuse of regex.

$ cat countme
1   0   0   1   0   1   1   0   1
1   1   0   1   0   1   0   1   1
1   1   1   1   1   1   1   1   1
0   0   0   0   0   0   0   0   0

$ grep -P '^[0\t]*(1[0\t]*){5}[0\t]*$' countme # Match exactly 5
1   0   0   1   0   1   1   0   1

$ grep -P '^[0\t]*(1[0\t]*){5,}[0\t]*$' countme # Match >=5
1   0   0   1   0   1   1   0   1
1   1   0   1   0   1   0   1   1
1   1   1   1   1   1   1   1   1

edited May 26 '16 at 03:37

answered May 26 '16 at 03:32

anishsane

20,270
5
40
73

should be countme? – once May 26 '16 at 03:35

John Bollinger · Answer 3 · 2016-05-26T03:47:13.777

2

You can do this to get lines with exactly five '1's:

grep '^[^1]*\(1[^1]*\)\{5,5\}[^1]*$'

You can simplify that to this for at least five '1's:

grep '\(1[^1]*\)\{5,\}'

The enumerated quantifier (\{n,m\}) enables you to conveniently specify a particular number or range of numbers of consecutive matches to a sub-pattern. To avoid matching lines with extra matches to such a pattern, however, you must also anchor it to the beginning and end of the line.

The other other trick is to make sure the gaps previous to the first 1, between the 1s, and after the last 1 are matched. In your case, all of those gaps can be represented pretty simply as ranges of zero or more characters other than 1: [^1]*. Putting those pieces together gives you the above regular expressions.

edited May 26 '16 at 03:47

answered May 26 '16 at 03:36

John Bollinger

160,171
8
81
157

"Exactly five" is just `\{5\}`, no need for `\{5,5\}`, is there? – Benjamin W. May 26 '16 at 03:46
3

@BenjaminW., both forms work. I used `\{5,5\}` because it is more illustrative of enumerated quantifiers in general. Indeed, I could have used just `\{5\}` in the second example, too, because that pattern is not anchored. – John Bollinger May 26 '16 at 03:50

score 1 · Answer 4 · answered May 26 '16 at 05:03

Do

sed -nE '/^([^1]*1[^1]*){5}$/p' your_file

for exactly 5 matches and

sed -nE '/^([^1]*1[^1]*){5,}$/p' your_file

for 5 or more matches.

Note: In GNU sed you may not see the -E option in the manpage, but it is supported. Using -E is for portability to, say, Mac OSX.

score 1 · Answer 5 · answered Oct 08 '17 at 08:18

with perl

$ perl -ane 'print if (grep {$_==1} @F) == 5' ip.txt 
1   0   0   1   0   1   1   0   1

$ perl -ane 'print if (grep {$_==1} @F) >= 5' ip.txt 
1   0   0   1   0   1   1   0   1
1   1   0   1   0   1   0   1   1
1   1   1   1   1   1   1   1   1

-a to automatically split input line on whitespaces and save to @F array
grep {$_==1} @F returns array with elements from @F array which are exactly equal to 1
(grep {$_==1} @F) == 5 in scalar context, comparison will be done based on number of elements of array
See http://perldoc.perl.org/perlrun.html#Command-Switches for details on -ane options

How do I filter tab-separated input by the count of fields with a given value?

5 Answers5