21

I have a file which contains "title" written in it many times. How can I find the number of times "title" is written in that file using the sed command provided that "title" is the first string in a line? e.g.

# title
title
title

should output the count = 2 because in first line title is not the first string.

Update

I used awk to find the total number of occurrences as:

awk '$1 ~ /title/ {++c} END {print c}' FS=: myFile.txt

But how can I tell awk to count only those lines having title the first string as explained in example above?

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Uthman
  • 9,251
  • 18
  • 74
  • 104
  • 4
    Search pattern /^title/ looks for title at the start of a line. Using '`grep -c '^title'`' is simpler than awk, too; probably faster too. – Jonathan Leffler Nov 23 '09 at 06:16
  • Change your regular expression to match the beginning of the line: /^title/ – R Samuel Klatchko Nov 23 '09 at 06:18
  • Thanks a lot Jonathan. Grep surely is simpler. Thanks to you too Samuel. – Uthman Nov 23 '09 at 06:29
  • The key, by the way, to counting only lines that start with "title" is to use a regex anchor: `/^title/` where `^` means the beginning of the line. Similarly, `$` means the end of the line as in `/book$/` which matches lines that end in "book". If you want to match a whole line, you can use both anchors: `/^title of the book$/` which would _not_ match "The title of the book is War and Peace." – Dennis Williamson Feb 01 '19 at 17:49

6 Answers6

23

Never say never. Pure sed (although it may require the GNU version).

#!/bin/sed -nf
# based on a script from the sed info file (info sed)
# section 4.8 Numbering Non-blank Lines (cat -b)
# modified to count lines that begin with "title"

/^title/! be

x
/^$/ s/^.*$/0/
/^9*$/ s/^/0/
s/.9*$/x&/
h
s/^.*x//
y/0123456789/1234567890/
x
s/x.*$//
G
s/\n//
h

:e

$ {x;p}

Explanation:

#!/bin/sed -nf
# run sed without printing output by default (-n)
# using the following file as the sed script (-f)

/^title/! be        # if the current line doesn't begin with "title" branch to label e

x                   # swap the counter from hold space into pattern space
/^$/ s/^.*$/0/      # if pattern space is empty start the counter at zero
/^9*$/ s/^/0/       # if pattern space starts with a nine, prepend a zero
s/.9*$/x&/          # mark the position of the last digit before a sequence of nines (if any)
h                   # copy the marked counter to hold space
s/^.*x//            # delete everything before the marker
y/0123456789/1234567890/   # increment the digits that were after the mark
x                   # swap pattern space and hold space
s/x.*$//            # delete everything after the marker leaving the leading digits
G                   # append hold space to pattern space
s/\n//              # remove the newline, leaving all the digits concatenated
h                   # save the counter into hold space

:e                  # label e

$ {x;p}             # if this is the last line of input, swap in the counter and print it

Here are excerpts from a trace of the script using sedsed:

$ echo -e 'title\ntitle\nfoo\ntitle\nbar\ntitle\ntitle\ntitle\ntitle\ntitle\ntitle\ntitle\ntitle' | sedsed-1.0 -d -f ./counter 
PATT:title$
HOLD:$
COMM:/^title/ !b e
COMM:x
PATT:$
HOLD:title$
COMM:/^$/ s/^.*$/0/
PATT:0$
HOLD:title$
COMM:/^9*$/ s/^/0/
PATT:0$
HOLD:title$
COMM:s/.9*$/x&/
PATT:x0$
HOLD:title$
COMM:h
PATT:x0$
HOLD:x0$
COMM:s/^.*x//
PATT:0$
HOLD:x0$
COMM:y/0123456789/1234567890/
PATT:1$
HOLD:x0$
COMM:x
PATT:x0$
HOLD:1$
COMM:s/x.*$//
PATT:$
HOLD:1$
COMM:G
PATT:\n1$
HOLD:1$
COMM:s/\n//
PATT:1$
HOLD:1$
COMM:h
PATT:1$
HOLD:1$
COMM::e
COMM:$ {
PATT:1$
HOLD:1$
PATT:title$
HOLD:1$
COMM:/^title/ !b e
COMM:x
PATT:1$
HOLD:title$
COMM:/^$/ s/^.*$/0/
PATT:1$
HOLD:title$
COMM:/^9*$/ s/^/0/
PATT:1$
HOLD:title$
COMM:s/.9*$/x&/
PATT:x1$
HOLD:title$
COMM:h
PATT:x1$
HOLD:x1$
COMM:s/^.*x//
PATT:1$
HOLD:x1$
COMM:y/0123456789/1234567890/
PATT:2$
HOLD:x1$
COMM:x
PATT:x1$
HOLD:2$
COMM:s/x.*$//
PATT:$
HOLD:2$
COMM:G
PATT:\n2$
HOLD:2$
COMM:s/\n//
PATT:2$
HOLD:2$
COMM:h
PATT:2$
HOLD:2$
COMM::e
COMM:$ {
PATT:2$
HOLD:2$
PATT:foo$
HOLD:2$
COMM:/^title/ !b e
COMM:$ {
PATT:foo$
HOLD:2$
. . .
PATT:10$
HOLD:10$
PATT:title$
HOLD:10$
COMM:/^title/ !b e
COMM:x
PATT:10$
HOLD:title$
COMM:/^$/ s/^.*$/0/
PATT:10$
HOLD:title$ 
COMM:/^9*$/ s/^/0/
PATT:10$
HOLD:title$
COMM:s/.9*$/x&/
PATT:1x0$
HOLD:title$
COMM:h
PATT:1x0$
HOLD:1x0$
COMM:s/^.*x//
PATT:0$
HOLD:1x0$
COMM:y/0123456789/1234567890/
PATT:1$
HOLD:1x0$
COMM:x
PATT:1x0$
HOLD:1$
COMM:s/x.*$//
PATT:1$
HOLD:1$
COMM:G
PATT:1\n1$
HOLD:1$
COMM:s/\n//
PATT:11$
HOLD:1$
COMM:h
PATT:11$
HOLD:11$
COMM::e
COMM:$ {
COMM:x
PATT:11$
HOLD:11$
COMM:p
11
PATT:11$
HOLD:11$
COMM:}
PATT:11$
HOLD:11$

The ellipsis represents lines of output I omitted here. The line with "11" on it by itself is where the final count is output. That's the only output you'd get when the sedsed debugger isn't being used.

Dennis Williamson
  • 346,391
  • 90
  • 374
  • 439
  • 1
    I would love an explanation of this. – Thomas G Henry LLC Dec 07 '17 at 16:12
  • 1
    @ThomasGHenry: I added an explanation – Dennis Williamson Dec 07 '17 at 18:59
  • Just say No: Despite prodigious technical acument, this is a fine example of programming that is "too clever". `sed` is not the right tool for this job; the Revised answer below should be the accepted answer. – EdwardG Mar 25 '22 at 13:24
  • 2
    @EdwardG: I agree that doing this is absurd, but it answers the question as originally asked. It's a perfect example of "Hold my beer!" The OP accepted [another answer](https://stackoverflow.com/a/1781353/26428) (not mine) shortly after it was posted. In the intervening time, visitors to this question upvoted each of the answers by different amounts. Who knows precisely why mine received ever so slightly more? – Dennis Williamson Jun 23 '22 at 23:27
  • Another option would be to use `rg` which is faster and can count. The solution is absurd as you point out and writing clean and concise code should be the main concern. Do you know how long that answer to to formulate? I will tell you: Too long – EdwardG Jun 25 '22 at 18:15
  • "Too long" is correct regardless of the following qualifiers: It was based on an existing script, there was a time when I had these techniques well-practiced and I love a challenge. – Dennis Williamson Jun 26 '22 at 15:12
16

Revised answer

Succinctly, you can't - sed is not the correct tool for the job (it cannot count).

sed -n '/^title/p' file | grep -c

This looks for lines starting title and prints them, feeding the output into grep to count them. Or, equivalently:

grep -c '^title' file

Original answer - before the question was edited

Succinctly, you can't - it is not the correct tool for the job.

grep -c title file

sed -n /title/p file | wc -l

The second uses sed as a surrogate for grep and sends the output to 'wc' to count lines. Both count the number of lines containing 'title', rather than the number of occurrences of title. You could fix that with something like:

cat file |
tr ' ' '\n' |
grep -c title

The 'tr' command converts blanks into newlines, thus putting each space separated word on its own line, and therefore grep only gets to count lines containing the word title. That works unless you have sequences such as 'title-entitlement' where there's no space separating the two occurrences of title.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
12

I don't think sed would be appropriate, unless you use it in a pipeline to convert your file so that the word you need appears on separate lines, and then use grep -c to count the occurrences.

I like Jonathan's idea of using tr to convert spaces to newlines. The beauty of this method is that successive spaces get converted to multiple blank lines but it doesn't matter because grep will be able to count just the lines with the single word 'title'.

pavium
  • 14,808
  • 4
  • 33
  • 50
5
sed 's/title/title\n/g' file | grep -c title
ghostdog74
  • 327,991
  • 56
  • 259
  • 343
4

just one gawk command will do. Don't use grep -c because it only counts line with "title" in it, regardless of how many "title"s there are in the line.

$ more file
#         title
#  title
one
two
#title
title title
three
title junk title
title
four
fivetitlesixtitle
last

$ awk '!/^#.*title/{m=gsub("title","");total+=m}END{print "total: "total}' file
total: 7

if you just want "title" as the first string, use "==" instead of ~

awk '$1 == "title"{++c}END{print c}' file
ghostdog74
  • 327,991
  • 56
  • 259
  • 343
  • how can i just count those lines having title the first string? That will make total: 3 in you example – Uthman Nov 23 '09 at 06:16
  • Since the question got changed (probably while you were answering), there is no longer a need to count the number of occurrences of title anywhere in a line - only those at the start of the line count. – Jonathan Leffler Nov 23 '09 at 06:18
  • @Johnathan, it doesn't matter. this method does it all. If requirement changes to count "title" everywhere, there is minimal change to the code. – ghostdog74 Nov 23 '09 at 06:20
3

This might work for you:

sed '/^title/!d' file | sed -n '$='
potong
  • 55,640
  • 6
  • 51
  • 83