2

I want to create a function that takes as an input a string which is a text, and I want to capitalize every letter that lies after a punctuation. The thing is, strings don't work like lists so I don't really know how to do it, I tried to do this, but it doesn't seem to be working :

def capitalize(strin):
    listrin=list(strin)
    listrin[0]=listrin[0].upper()
    ponctuation=['.','!','?']
    strout=''
    for x in range (len(listrin)):
        if listrin[x] in ponctuation:
            if x!=len(listrin):
                if listrin[x+1]!=" ":
                    listrin[x+1]=listrin[x+1].upper()
                elif listrin[x+2]!=" ":
                    listrin[x+1]=listrin[x+1].upper()
    for y in range(len(listrin)):
        strout=strout+listrin[y]
    return strout

For now, I am trying to solve it with this string: 'hello! how are you? please remember capitalization. EVERY time.'

Gino Mempin
  • 25,369
  • 29
  • 96
  • 135

2 Answers2

4

I use regexp to do this.

>>> import re
>>> line = 'hi. hello!   how are you?  fine!  me too, haha. haha.'
>>> re.sub(r"(?:^|(?:[.!?]\s+))(.)",lambda m: m.group(0).upper(), line)
'Hi. Hello!   How are you?  Fine!  Me too, haha. Haha.'
Kinght 金
  • 17,681
  • 4
  • 60
  • 74
1

The most basic approach is to split the sentences based on the punctuation, then you will have a list. Then loop into all the items of list, strip() them and then capitalize() them. Something like below might solve your problem:

import re
input_sen = 'hello! how are you? please remember capitalization. EVERY time.'
sentence = re.split(pass_your_punctuation_list_here, input_sen)
    for i in sentence:
        print(i.strip().capitalize(), end='')

However better to use nltk library:

from nltk.tokenize import sent_tokenize
input_sen = 'hello! how are you? please remember capitalization. EVERY time.'
sentences = sent_tokenize(input_sen)
sentences = [sent.capitalize() for sent in sentences]
print(sentences)

It is better to use NLTK library or some other NLP library than manually writing rules and regex because it takes care of many cases which we don't account. It solves the problem of Sentence boundary disambiguation.

Sentence boundary disambiguation (SBD), also known as sentence breaking, is the problem in natural language processing of deciding where sentences begin and end. Often natural language processing tools require their input to be divided into sentences for a number of reasons. However sentence boundary identification is challenging because punctuation marks are often ambiguous. For example, a period may denote an abbreviation, decimal point, an ellipsis, or an email address – not the end of a sentence. About 47% of the periods in the Wall Street Journal corpus denote abbreviations. As well, question marks and exclamation marks may appear in embedded quotations, emoticons, computer code, and slang. Languages like Japanese and Chinese have unambiguous sentence-ending markers.

Hope it helps.

sid8491
  • 6,622
  • 6
  • 38
  • 64
  • Thank you very much! What is tokenize? – Sammy Steffensen Nov 29 '17 at 12:17
  • I have updated the code and used different method to break input into sentences. Tokenizers is used to divide strings into lists of substrings. eg, sent_tokenize is used to find list of sentences. @SammySteffensen – sid8491 Nov 30 '17 at 06:14