using the command line and regex to determine words that start sentences

Question

I have the text:

 This is a test. This is only a test! If there were an emergency, then Information would be provided for you.

I want to be able to determine which words start sentences. What I have now is:

 $ cat <FILE> | perl -pe 's/[\s.?!]/\n/g;'

This just gets rid of punctuation and replaces them with newlines, giving me:

 This 
 is 
 a 
 test 

 This
 is
 only
 a
 test

 If
 there
 were
 an
 emergency,
 then
 Information
 would
 be
 provided
 for
 you

From here I could somehow extract the words that have either nothing above them (start of file) or a blank space, but I am unsure of exactly how to do this.

See [Regex to match first word in sentence](http://stackoverflow.com/questions/14767080/regex-to-match-first-word-in-sentence) — Håkon Hægland, Sep 14 '16 at 15:13
Not what I am looking for. When matching, it will include the punctuation. Also, I do not know how to extract the matches via grep. — basil, Sep 14 '16 at 15:18

score 6 · Accepted Answer · edited May 23 '17 at 12:01

6

If you have a Perl of at least version 5.22.1 (or 5.22.0 and this case is not affected by the bug described here), then you can use the sentence boundaries in your regular expression.

use feature 'say';

foreach my $sentence (m/\b{sb}(\w+)/g) {
    say $sentence;
}

Or, as a one-liner:

perl -nE 'say for /\b{sb}(\w+)/g'

If called with your example text, the output is:

This
This
If

It uses \b{sb}, which is the sentence boundary. You can read a tutorial at brian d foy's blog about it. The \b{} is called a unicode boundary and is described in perlrebackslash.

edited May 23 '17 at 12:01

Community

1
1

answered Sep 14 '16 at 15:21

simbabque

53,749
8
73
136

1

Hmm, nice solution. I thought about `\p{punct}` but of course that gets comma (and other things) as well. – Sobrique Sep 14 '16 at 15:23
@Sobrique I just tried with different scripts, but that doesn't seem to work properly. At least on my command line, when I used the Armenian Google translation of the text, it broke. – simbabque Sep 14 '16 at 15:35
What would the converse of this be i.e. capitalized words that don't start a sentence? – basil Sep 14 '16 at 16:54
@basil `perl -nE 'say for /\B{sb}([A-Z]\w+)\b/g'` works. With your test, that's only _Information_. – simbabque Sep 15 '16 at 07:47

Sobrique · Answer 2 · 2016-09-14T15:34:02.557

1

#!/usr/bin/env perl

use strict;
use warnings;
use Data::Dumper;

local $/;

my @words = <DATA> =~ m/(?:^|[\.!]+)\s+(\w+)/g;

print Dumper \@words;

__DATA__
 This is a test. This is only a test! If there were an emergency, then Information would be provided for you.

So as a command line:

perl -ne 'print join "\n", m/(?:^|[\.!])\s+(\w+)/g;' somefile

edited Sep 14 '16 at 15:34

answered Sep 14 '16 at 15:16

Sobrique

52,974
7
60
101

anubhava · Answer 3 · 2016-09-14T15:43:52.423

1

You can use this gnu grep command to extract first after each period or ! or ?:

grep -oP '(?:^|[.?!])\s*\K[A-Z][a-z]+' file

This
This
If

Though I must caution you may get false results for cases like Mr. Smith.

Regex Breakup:

(?:^|[.?!]) - match start or DOT or ! or ?
\s* - match 0 or more whitespaces
\K - match reset to forget matched data
[A-Z][a-z]+ - match a word startign with upper case letter

edited Sep 14 '16 at 15:43

answered Sep 14 '16 at 15:23

anubhava

761,203
64
569
643

This is the simplest solution, but simbabque's solution works as well. – basil Sep 14 '16 at 15:26
@basil And by _simple_ you mean _short_? :P – simbabque Sep 14 '16 at 15:27
1

Yes. I'm not a perl champion yet and I'm still getting used to just doing pattern matching from the command line. I don't mean to criticize or complain, I apologize. Also, I tested this on a much larger document and it was, for some reason, catching on to single-characters that were not beginning of sentences. I changed the last * to a + to account for this. It works in my situation since I am working with a formal document that does not use first person pronouns but would be something to look into. – basil Sep 14 '16 at 15:43
Yes indeed to avoid single character match use `[A-Z][a-z]+` – anubhava Sep 14 '16 at 15:44

using the command line and regex to determine words that start sentences

3 Answers3