0

I have a sentence in a text file that I want to display in python, but I want to display it so after every full stop(period) a new line starts.

For example my paragraph is

"Dr. Harrison bought bargain.co.uk for 2.5 million pounds, i.e. he
paid a lot for it. Did he mind? John Smith, Esq. thinks he didn't.
Nevertheless, this isn't true... Well, with a probability of .9 it
isn't."

But I want it display as the following

"Dr. Harrison bought bargain.co.uk for 2.5 million pounds, i.e. he
paid a lot for it. 
Did he mind? John Smith, Esq. thinks he didn't. 
Nevertheless, this isn't true... 
Well, with a probability of .9 it isn’t."

This is made increasingly difficult with the other periods that appear in the sentence, such as in the website address, the 'Dr.', the 'Esq.' the '.9' and of course the first two dots in the ellipsis.

I am not sure how to approach this with regards to the other periods that exist in the text file, can anyone help? thank you.

"Your task is to write a program that given the name of a text file is able to write its content with each sentence on a separate line." <-- Task set

Baileyavfc
  • 47
  • 1
  • 9
  • 4
    That is not a question about Python (or any other programming language) but about what a sentence is in English. –  Mar 10 '14 at 15:21
  • 1
    Without a proper sentence analyzis this is almost impossible. You can try a dictionary with known good cases but everything beyond might be too hard. – RedX Mar 10 '14 at 15:22
  • 1
    No I am asking how to display a text file in python so that it creates a new line after every period. – Baileyavfc Mar 10 '14 at 15:22
  • how do you deal with period in `i.e.`? – zhangxaochen Mar 10 '14 at 15:23
  • 1
    Well, you asking how to distinguish between the period which is full stop and the other periods. This is not a python question at all, this belongs to Natural Language Processing domain. – Ashalynd Mar 10 '14 at 15:23
  • Well it is a python task that I have been asked to accomplish, so it is a python related question, maybe I can split the sentence without relating to the periods, just the order of them. – Baileyavfc Mar 10 '14 at 15:25

2 Answers2

5

This does the job on your text:

text = "Dr. Harrison bought bargain.co.uk for 2.5 million pounds, i.e. he "\
       "paid a lot for it. Did he mind? John Smith, Esq. thinks he didn't. "\
       "Nevertheless, this isn't true... Well, with a probability of .9 it "\
       "isn't."

import re

pat = ('(?<!Dr)(?<!Esq)\. +(?=[A-Z])')
print re.sub(pat,'.\n',text)

result

Dr. Harrison bought bargain.co.uk for 2.5 million pounds, i.e. he paid a lot for it.
Did he mind? John Smith, Esq. thinks he didn't.
Nevertheless, this isn't true...
Well, with a probability of .9 it isn't.

But it is impossible to have a regex pattern that will never fail in such a complex thing as is a human writing.
Note for example that I was obliged to put a negative lookbehind assertion to exclude the case of Dr. (and I did the same for Esq. though it doesn't represent a problem in your text because it is followed with thinks that doesn't begin with a capital letter)
I think it's impossibe to put all the similar cases in the regex pattern in advance, there always will be untought cases that will happen one day or another.

But this code does a lot of the desired job, though. Not so bad, I esteem.

eyquem
  • 26,771
  • 7
  • 38
  • 46
1

You could add a line break if and only if the dot is followed by a space AND a capital letter. It won't solve all of the cases, but combined with the use of a dictionary of exceptions like "Dr.", you could do a pretty good job, although not perfect.

update: By a dictionary I mean both a Python dictionary and a word list like this one I did not find any downloadable file containing the most common abbreviations, so I'm afraid you'll have to make one by yourself.

Pascal Le Merrer
  • 5,883
  • 20
  • 35
  • I'm happy to read that my answer does a pretty good job :) – eyquem Mar 10 '14 at 15:51
  • @eyquem When I started to write my answer, yours wasn't posted yet. And it took me a while 'cause I was looking for a downloadable word list. There is not reason to be ironic. – Pascal Le Merrer Mar 11 '14 at 12:21