Count the number of occurrences of a string using sed?

Question

I have a file which contains "title" written in it many times. How can I find the number of times "title" is written in that file using the sed command provided that "title" is the first string in a line? e.g.

# title
title
title

should output the count = 2 because in first line title is not the first string.

Update

I used awk to find the total number of occurrences as:

awk '$1 ~ /title/ {++c} END {print c}' FS=: myFile.txt

But how can I tell awk to count only those lines having title the first string as explained in example above?

Search pattern /^title/ looks for title at the start of a line. Using '`grep -c '^title'`' is simpler than awk, too; probably faster too. — Jonathan Leffler, Nov 23 '09 at 06:16
Change your regular expression to match the beginning of the line: /^title/ — R Samuel Klatchko, Nov 23 '09 at 06:18
Thanks a lot Jonathan. Grep surely is simpler. Thanks to you too Samuel. — Uthman, Nov 23 '09 at 06:29
The key, by the way, to counting only lines that start with "title" is to use a regex anchor: `/^title/` where `^` means the beginning of the line. Similarly, `$` means the end of the line as in `/book$/` which matches lines that end in "book". If you want to match a whole line, you can use both anchors: `/^title of the book$/` which would _not_ match "The title of the book is War and Peace." — Dennis Williamson, Feb 01 '19 at 17:49

Dennis Williamson · Answer 1 · 2017-12-07T18:58:59.977

Never say never. Pure sed (although it may require the GNU version).

#!/bin/sed -nf
# based on a script from the sed info file (info sed)
# section 4.8 Numbering Non-blank Lines (cat -b)
# modified to count lines that begin with "title"

/^title/! be

x
/^$/ s/^.*$/0/
/^9*$/ s/^/0/
s/.9*$/x&/
h
s/^.*x//
y/0123456789/1234567890/
x
s/x.*$//
G
s/\n//
h

:e

$ {x;p}

Explanation:

#!/bin/sed -nf
# run sed without printing output by default (-n)
# using the following file as the sed script (-f)

/^title/! be        # if the current line doesn't begin with "title" branch to label e

x                   # swap the counter from hold space into pattern space
/^$/ s/^.*$/0/      # if pattern space is empty start the counter at zero
/^9*$/ s/^/0/       # if pattern space starts with a nine, prepend a zero
s/.9*$/x&/          # mark the position of the last digit before a sequence of nines (if any)
h                   # copy the marked counter to hold space
s/^.*x//            # delete everything before the marker
y/0123456789/1234567890/   # increment the digits that were after the mark
x                   # swap pattern space and hold space
s/x.*$//            # delete everything after the marker leaving the leading digits
G                   # append hold space to pattern space
s/\n//              # remove the newline, leaving all the digits concatenated
h                   # save the counter into hold space

:e                  # label e

$ {x;p}             # if this is the last line of input, swap in the counter and print it

Here are excerpts from a trace of the script using sedsed:

$ echo -e 'title\ntitle\nfoo\ntitle\nbar\ntitle\ntitle\ntitle\ntitle\ntitle\ntitle\ntitle\ntitle' | sedsed-1.0 -d -f ./counter 
PATT:title$
HOLD:$
COMM:/^title/ !b e
COMM:x
PATT:$
HOLD:title$
COMM:/^$/ s/^.*$/0/
PATT:0$
HOLD:title$
COMM:/^9*$/ s/^/0/
PATT:0$
HOLD:title$
COMM:s/.9*$/x&/
PATT:x0$
HOLD:title$
COMM:h
PATT:x0$
HOLD:x0$
COMM:s/^.*x//
PATT:0$
HOLD:x0$
COMM:y/0123456789/1234567890/
PATT:1$
HOLD:x0$
COMM:x
PATT:x0$
HOLD:1$
COMM:s/x.*$//
PATT:$
HOLD:1$
COMM:G
PATT:\n1$
HOLD:1$
COMM:s/\n//
PATT:1$
HOLD:1$
COMM:h
PATT:1$
HOLD:1$
COMM::e
COMM:$ {
PATT:1$
HOLD:1$
PATT:title$
HOLD:1$
COMM:/^title/ !b e
COMM:x
PATT:1$
HOLD:title$
COMM:/^$/ s/^.*$/0/
PATT:1$
HOLD:title$
COMM:/^9*$/ s/^/0/
PATT:1$
HOLD:title$
COMM:s/.9*$/x&/
PATT:x1$
HOLD:title$
COMM:h
PATT:x1$
HOLD:x1$
COMM:s/^.*x//
PATT:1$
HOLD:x1$
COMM:y/0123456789/1234567890/
PATT:2$
HOLD:x1$
COMM:x
PATT:x1$
HOLD:2$
COMM:s/x.*$//
PATT:$
HOLD:2$
COMM:G
PATT:\n2$
HOLD:2$
COMM:s/\n//
PATT:2$
HOLD:2$
COMM:h
PATT:2$
HOLD:2$
COMM::e
COMM:$ {
PATT:2$
HOLD:2$
PATT:foo$
HOLD:2$
COMM:/^title/ !b e
COMM:$ {
PATT:foo$
HOLD:2$
. . .
PATT:10$
HOLD:10$
PATT:title$
HOLD:10$
COMM:/^title/ !b e
COMM:x
PATT:10$
HOLD:title$
COMM:/^$/ s/^.*$/0/
PATT:10$
HOLD:title$ 
COMM:/^9*$/ s/^/0/
PATT:10$
HOLD:title$
COMM:s/.9*$/x&/
PATT:1x0$
HOLD:title$
COMM:h
PATT:1x0$
HOLD:1x0$
COMM:s/^.*x//
PATT:0$
HOLD:1x0$
COMM:y/0123456789/1234567890/
PATT:1$
HOLD:1x0$
COMM:x
PATT:1x0$
HOLD:1$
COMM:s/x.*$//
PATT:1$
HOLD:1$
COMM:G
PATT:1\n1$
HOLD:1$
COMM:s/\n//
PATT:11$
HOLD:1$
COMM:h
PATT:11$
HOLD:11$
COMM::e
COMM:$ {
COMM:x
PATT:11$
HOLD:11$
COMM:p
11
PATT:11$
HOLD:11$
COMM:}
PATT:11$
HOLD:11$

The ellipsis represents lines of output I omitted here. The line with "11" on it by itself is where the final count is output. That's the only output you'd get when the sedsed debugger isn't being used.

Just say No: Despite prodigious technical acument, this is a fine example of programming that is "too clever". `sed` is not the right tool for this job; the Revised answer below should be the accepted answer. — EdwardG, Mar 25 '22 at 13:24
@EdwardG: I agree that doing this is absurd, but it answers the question as originally asked. It's a perfect example of "Hold my beer!" The OP accepted [another answer](https://stackoverflow.com/a/1781353/26428) (not mine) shortly after it was posted. In the intervening time, visitors to this question upvoted each of the answers by different amounts. Who knows precisely why mine received ever so slightly more? — Dennis Williamson, Jun 23 '22 at 23:27
Another option would be to use `rg` which is faster and can count. The solution is absurd as you point out and writing clean and concise code should be the main concern. Do you know how long that answer to to formulate? I will tell you: Too long — EdwardG, Jun 25 '22 at 18:15
"Too long" is correct regardless of the following qualifiers: It was based on an existing script, there was a time when I had these techniques well-practiced and I love a challenge. — Dennis Williamson, Jun 26 '22 at 15:12

score 16 · Answer 2 · answered Nov 23 '09 at 06:05

Revised answer

Succinctly, you can't - sed is not the correct tool for the job (it cannot count).

sed -n '/^title/p' file | grep -c

This looks for lines starting title and prints them, feeding the output into grep to count them. Or, equivalently:

grep -c '^title' file

Original answer - before the question was edited

Succinctly, you can't - it is not the correct tool for the job.

grep -c title file

sed -n /title/p file | wc -l

The second uses sed as a surrogate for grep and sends the output to 'wc' to count lines. Both count the number of lines containing 'title', rather than the number of occurrences of title. You could fix that with something like:

cat file |
tr ' ' '\n' |
grep -c title

The 'tr' command converts blanks into newlines, thus putting each space separated word on its own line, and therefore grep only gets to count lines containing the word title. That works unless you have sequences such as 'title-entitlement' where there's no space separating the two occurrences of title.

pavium · Accepted Answer · 2009-11-23T06:14:48.030

12

I don't think sed would be appropriate, unless you use it in a pipeline to convert your file so that the word you need appears on separate lines, and then use grep -c to count the occurrences.

I like Jonathan's idea of using tr to convert spaces to newlines. The beauty of this method is that successive spaces get converted to multiple blank lines but it doesn't matter because grep will be able to count just the lines with the single word 'title'.

edited Nov 23 '09 at 06:14

answered Nov 23 '09 at 06:04

pavium

14,808
4
33
50

Beat me by 15 seconds - drat. I should be less verbose. – Jonathan Leffler Nov 23 '09 at 06:06
I think i should then leave sed. May be awk will do the magic for me. See the updated question please. – Uthman Nov 23 '09 at 06:14

score 5 · Answer 4 · answered Nov 23 '09 at 10:55

5

sed 's/title/title\n/g' file | grep -c title

answered Nov 23 '09 at 10:55

ghostdog74

327,991
56
259
343

That's essentially the same as the first part of **Jonathan Leffler's** answer. – Dennis Williamson Nov 23 '09 at 10:57
yes, looks similar, but not quite. different way of doing it in sed. – ghostdog74 Nov 23 '09 at 12:15

ghostdog74 · Answer 5 · 2009-11-23T06:23:26.230

4

just one gawk command will do. Don't use grep -c because it only counts line with "title" in it, regardless of how many "title"s there are in the line.

$ more file
#         title
#  title
one
two
#title
title title
three
title junk title
title
four
fivetitlesixtitle
last

$ awk '!/^#.*title/{m=gsub("title","");total+=m}END{print "total: "total}' file
total: 7

if you just want "title" as the first string, use "==" instead of ~

awk '$1 == "title"{++c}END{print c}' file

edited Nov 23 '09 at 06:23

answered Nov 23 '09 at 06:12

ghostdog74

327,991
56
259
343

how can i just count those lines having title the first string? That will make total: 3 in you example – Uthman Nov 23 '09 at 06:16
Since the question got changed (probably while you were answering), there is no longer a need to count the number of occurrences of title anywhere in a line - only those at the start of the line count. – Jonathan Leffler Nov 23 '09 at 06:18
@Johnathan, it doesn't matter. this method does it all. If requirement changes to count "title" everywhere, there is minimal change to the code. – ghostdog74 Nov 23 '09 at 06:20

score 3 · Answer 6 · answered Dec 13 '11 at 00:45

3

This might work for you:

sed '/^title/!d' file | sed -n '$='

answered Dec 13 '11 at 00:45

potong

55,640
6
51
83

Count the number of occurrences of a string using sed?

6 Answers6

Revised answer

Original answer - before the question was edited

Linked