3

How to count number of integers in a file using egrep?

I tried to solve it as a pattern finding problem. Actually, I am facing problem of how to represent range of characters [0-9] continuously which include "space" before the beginning and "space or dot" after the end. I think the latter can be solved by using \< and \> respectively. Also, It should not include dot in between otherwise it will not be an integer. I am unable to convert this logic into regular expression using available tools and techniques.

My name is 2322.
33 is my sister.
I am blessed with a son named 55.
Why are you so 69. Is everything 33.
66.88 is not an integer
55whereareyou?

The right answer should be 5 i.e. for 2322, 33, 55, 69 and 33.

Cyrus
  • 84,225
  • 14
  • 89
  • 153
Delsilon
  • 156
  • 12
  • `` [0-9][ .]``? Isn't it easy enough? – user202729 Jan 20 '18 at 16:53
  • Can you explain about [ .] part little bit. I didn't understand why you use "double backticks" instead of single tick? Also, using this expression is still showing 66.88 as an integer. – Delsilon Jan 20 '18 at 17:00
  • Please add sample input and your desired output for that sample input to your question. – Cyrus Jan 20 '18 at 17:02
  • The double backticks are just a formattng error. We ue backticks here to format `code` but @user202729 apparently made a typo. (The proposed regex doesn't exclude floating-point numbers so if that's what you are asking, it doesn't work.) – tripleee Jan 22 '18 at 04:25
  • @tripleee It's not possible to put space at the beginning of code. – user202729 Jan 22 '18 at 04:33
  • ` huh`? Today I learned. The easy workaround is to put something in front of the remark, but that's a bug alright. (Why would you want a space at the beginning anyway, though?) – tripleee Jan 22 '18 at 05:37
  • For the record, a workaround for @user202729 's problem is here: https://meta.stackoverflow.com/questions/297113/how-to-insert-a-space-as-a-first-character-inside-of-backticks-in-comments – tripleee Jan 22 '18 at 05:44

3 Answers3

4
                    grep -Eo '(^| )([0-9]+[\.\?\=\:]?( |$))+' | wc -w
                          ^^    ^     ^       ^        ^   ^     ^
                          ||    |     |       |        |   |     |
E = extended regex--------+|    |     |       |        |   |     |
o = extract what found-----+    |     |       |        |   |     |
starts with new line or space---+     |       |        |   |     |
digits--------------------------------+       |        |   |     |
optional dot, question mark, etc.-------------+        |   |     |
ends with end line or space----------------------------+   |     |
repeat 1 time or more (to detect integers like "123 456")--+     |
count words------------------------------------------------------+

Note: 123. 123? 123: are also counted as integer

Test:

#!/bin/bash

exec 3<<EOF
My name is 2322.
33 is my sister.
I am blessed with a son named 55.
Why are you so 69. Is everything 33.
66.88 is not an integer
55whereareyou?
two integers 123 456.
how many tables in room 400? 50.
50? oh I thought it was 40.
23: It's late, 23:00 already
EOF

grep -Eo '(^| )([0-9]+[\.\?\=\:]?( |$))+' <&3 | \
  tee >(sleep 0.5; echo -n "integer counted: "; wc -w; )

Outputs:

 2322.
33 
 55.
 69. 
 33.
 123 456.
 400? 50.
50? 
 40.
23: 
integer counted: 12
Bach Lien
  • 1,030
  • 6
  • 7
  • Thanks for your answer, but I am new to Linux and still there are things which are not making sense to me. I think I should have tried more rigorously before posting it here. Maybe in a day or two I would be able to solve this. Btw thanks a lot again! – Delsilon Jan 20 '18 at 17:47
  • There is nothing here which dictates use of the `-P` option, which isn't standard or portable anyway. For this relatively simple regex, `-E` will work equally well. (Perhaps one day Perl regexes will be available everywhere; but then the command will probably not be called `grep`.) – tripleee Jan 20 '18 at 18:00
1

Based on the observation that you want 66.88 excluded, I'm guessing

grep -Ec '[0-9]\.?( |$)' file

which finds a digit, optionally followed by a dot, followed by either a space or end of line.

The -c option says to report the number of lines which contain a match (so not strictly the number of matches, if there are lines which contain multiple matches) and the -E option enables extended regular expression syntax, i.e. what was traditionally calned egrep (though the command name is now obsolescent).

If you need to count matches, the -o option prints each match on a separate line, which you can then pass to wc -l (or in lucky cases combine with grep -c, but check first; this doesn't work e.g. with GNU grep currently).

tripleee
  • 175,061
  • 34
  • 275
  • 318
0

On my ubuntu this code working fine

grep -P '((^)|(\s+))[-+]?\d+\.?((\s+)|($))' test
Victor Dronov
  • 139
  • 1
  • 12