convert paragraph into sentence using Perl

Question

I'm doing Perl programming. I need to read a paragraph and print it out each sentence as a line.

Anyone know how to do it?

Below is my code:

#! /C:/Perl64/bin/perl.exe

use utf8;

if (! open(INPUT, '< text1.txt')){
die "cannot open input file: $!";
}

if (! open(OUTPUT, '> output.txt')){
die "cannot open input file: $!";
}

select OUTPUT;

while (<INPUT>){
print "$_";
}

close INPUT;
close OUTPUT;
select STDOUT;

What do you mean by 'convert a paragraph into a sentence'? Do you mean 'split a paragraph into separate sentences' and then print out each sentence on a line on its own. How many sentences are there in "It seems the Mr. A. P. McDowney has been rather busy!"? — Jonathan Leffler, Apr 01 '13 at 03:00
@squiguy: the first `select` sets the default output stream to the named file (`OUTPUT`) so the unqualified `print` output goes to that file; the second resets the default, but since the script is about to exit is superfluous. — Jonathan Leffler, Apr 01 '13 at 03:01
@JonathanLeffler Okay, so it's a little shortcut for writing to a file. Thanks. — squiguy, Apr 01 '13 at 03:02

score 6 · Answer 1 · answered Apr 01 '13 at 03:46

Rather than handle file names, I'll let Perl do that.

This is very crude on multiple levels, and the full job is undoubtedly tough.

sentence.pl

#!/usr/bin/env perl
use strict;
use warnings;
use Lingua::EN::Sentence qw(get_sentences);

sub normalize
{
    my($str) = @_;
    $str =~ s/\n/ /gm;
    $str =~ s/\s\s+/ /gm;
    return $str;
}

{
    local $/ = "\n\n";
    while (<>)
    {
        chomp;
        print "Para: [[$_]]\n";
        my @sentences = split m/(?<=[.!?])\s+/m, $_;
        foreach my $sentence (@sentences)
        {
            $sentence = normalize $sentence;
            print "Ad Hoc Sentence: $sentence\n";
        }
        my $sref = get_sentences($_);
        foreach my $sentence (@$sref)
        {
            $sentence = normalize $sentence;
            print "Lingua Sentence: $sentence\n";
        }
    }
}

The split regex looks for one or more spaces preceded by a full stop (period), exclamation mark or question mark, and matches across multiple lines. The look-behind (?<=[.!?]) means that the punctuation is kept with the sentence. The normalize function simply flattens newlines into spaces and renders multiple spaces into single spaces. (Note that this would not properly recognize a parenthetical sentence.) This would be counted as part of the previous sentence, because the . is not followed by a blank.

Sample input

This is a paragraph with more than one sentence in it.  How many will be
determined later.  Mr. A. P. McDowney has been rather busy.  This
incomplete sentence will still be counted as one

This is the second paragraph.  With three sentences in it, it is a lot
less exciting than the first paragraph, but the middle sentence extends
over multiple lines and   there   is     some         wonky spacing too.
But 'tis time to finish.

Sample output

Para: [[This is a paragraph with more than one sentence in it.  How many will be
determined later.  Mr. A. P. McDowney has been rather busy.  This
incomplete sentence will still be counted as one]]
Ad Hoc Sentence: This is a paragraph with more than one sentence in it.
Ad Hoc Sentence: How many will be determined later.
Ad Hoc Sentence: Mr.
Ad Hoc Sentence: A.
Ad Hoc Sentence: P.
Ad Hoc Sentence: McDowney has been rather busy.
Ad Hoc Sentence: This incomplete sentence will still be counted as one
Lingua Sentence: This is a paragraph with more than one sentence in it.
Lingua Sentence: How many will be determined later.
Lingua Sentence: Mr. A. P. McDowney has been rather busy.
Lingua Sentence: This incomplete sentence will still be counted as one
Para: [[This is the second paragraph.  With three sentences in it, it is a lot
less exciting than the first paragraph, but the middle sentence extends
over multiple lines and   there   is     some         wonky spacing too.
But 'tis time to finish.
]]
Ad Hoc Sentence: This is the second paragraph.
Ad Hoc Sentence: With three sentences in it, it is a lot less exciting than the first paragraph, but the middle sentence extends over multiple lines and there is some wonky spacing too.
Ad Hoc Sentence: But 'tis time to finish.
Lingua Sentence: This is the second paragraph.
Lingua Sentence: With three sentences in it, it is a lot less exciting than the first paragraph, but the middle sentence extends over multiple lines and there is some wonky spacing too.
Lingua Sentence: But 'tis time to finish.

Notice how Lingua::EN::Sentence managed to handle 'Mr. A. P. McDowney' better than the simple-minded regex does.

score 4 · Answer 2 · answered Apr 01 '13 at 03:19

4

Identifying sentences is very hard and language-specific. You'll need help. Maybe Lingua::EN::Sentence is the way to go?

answered Apr 01 '13 at 03:19

ikegami

367,544
15
269
518

score -1 · Accepted Answer · edited Apr 01 '13 at 03:40

-1

If you are given the paragraph as a string, you can split() it on characters that mark the end of a sentence.

for example:

my @sentences = split /[.?!]/, $paragraph;

edited Apr 01 '13 at 03:40

ikegami

367,544
15
269
518

answered Apr 01 '13 at 03:16

Bitwise

7,577
6
33
50

Oh! That's so not right! How can you tell if the mark is at the end of the sentence? It fails to correctly handle this comment, for example. – ikegami Apr 01 '13 at 03:21
@ikegami the OP has not defined what he means by sentence. For many purposes, a sentence can be considered a sequence of words ending with a specified punctuation. I believe this is what the OP was aiming at, given his current code. – Bitwise Apr 01 '13 at 03:27
1

In what world do you think "Dr. No said yes." can be considered two sentences? If you're going to make up your own definitions, you should state them. – ikegami Apr 01 '13 at 03:38
His code doesn't show any attempts to identify sentences whatsoever, so how can you claim with a straight face that you know what he wants based on code he didn't even conceive yet! I didn't -1, but that claim made me want to. – ikegami Apr 01 '13 at 03:39
1

Oh, another problem: You actually remove the punctuation marks. – ikegami Apr 01 '13 at 03:40
@ikegami you see, it is actually what he wanted. No need to be negative. ;) thanks for the edit. – Bitwise Apr 01 '13 at 03:45

convert paragraph into sentence using Perl

3 Answers3

sentence.pl

Sample input

Sample output