1

I have been searching for the solution to this problem. I am writing a custom function to count number of sentences. I tried nltk and textstat for this problem but both are giving me different counts.

An Example of a sentence is something like this.

Annie said, "Are you sure? How is it possible? you are joking, right?"

NLTK is giving me --> count=3.

['Annie said, "Are you sure?', 'How is it possible?', 'you are joking, right?"']

another example:

Annie said, "It will work like this! you need to go and confront your friend. Okay!"

NLTK is giving me --> count=3.

Please suggest. The expected count is 1 as it is a single direct sentence.

DataEater
  • 11
  • 3

1 Answers1

0

I have written a simple function that does what you want:

def sentences_counter(text: str):

    end_of_sentence = ".?!…"
    # complete with whatever end of a sentence punctuation mark I might have forgotten
    # you might for instance want to add '\n'.

    sentences_count = 0
    sentences = []
    inside_a_quote = False
    
    start_of_sentence = 0
    last_end_of_sentence = -2
    for i, char in enumerate(text):
        
        # quote management, to solve your issue
        if char == '"':
            inside_a_quote = not inside_a_quote
            if not inside_a_quote and text[i-1] in end_of_sentence: # 
                last_end_of_sentence = i                            # 
        elif inside_a_quote:
            continue

        # basic management of sentences with the punctuation marks in `end_of_sentence`
        if char in end_of_sentence:
            last_end_of_sentence = i
        elif last_end_of_sentence == i-1:
            sentences.append(text[start_of_sentence:i].strip())
            sentences_count += 1
            start_of_sentence = i
    
    # same as the last block in case there is no end punctuation mark in the text
    last_sentence = text[start_of_sentence:]
    if last_sentence:
        sentences.append(last_sentence.strip())
        sentences_count += 1
    
    return sentences_count, sentences

Consider the following:

text = '''Annie said, "Are you sure? How is it possible? you are joking, right?" No, I'm not... I thought you were'''

To generalize your problem a bit, I added 2 more sentences, one with ellipsis and the last one without even any end punctuation mark. Now, if I execute this:

sentences_count, sentences = sentences_counter(text)
print(f'{sentences_count} sentences detected.')
print(f'The detected sentences are: {sentences}')

I obtain this:

3 sentences detected.
The detected sentences are: ['Annie said, "Are you sure? How is it possible? you are joking, right?"', "No, I'm not...", 'I thought you were']

I think it works fine.

Note: Please consider the quote management of my solution works for American style quotes, where the end punctuation mark of the sentence can be inside of the quote. Remove the lines where I have put flag emojis to disable this.

frogger
  • 41
  • 4