How can I do sentence tokenization

Question

This is the code that I used for sent_tokenize

import nltk
from nltk.tokenize import sent_tokenize
sent_tokenize(comments1)

Dataset

And I used an array to get sentences one by one but it didn't work

Arr=sent_tokenize(comments1)
Arr
Arr[0]

And when I use Arr[1] this error come up

IndexError                                
Traceback (most recent call last) <ipython-input-27-c15dd30f2746> in <module>
----> 1 Arr[1]

IndexError: list index out of range

score 0 · Answer 1 · answered Oct 08 '19 at 19:34

NLTK's sent_tokenize works on well-formatted text. I think you're looking for a regular expression:

import re

comments_str = "1,Opposition MP Namal Rajapaksa questions Environment Minister President Sirisena over Wilpattu deforestation issue\nbut he should remember that it all started with his dad and uncle and might be he was in a coma in that days \n3, Opposition on MP Namal Rajapaksa questions Environment Minister President Sirisena over Wilpattu deforestation issue\n4,Pawu meya ba meyage thathata oka deddi kiyana thibbane"
comments = re.split(r'(?:^\d+,)|(?:\n\d+,)', comments_str)
print(comments)

Outputs:

[
    '',
    'Opposition MP Namal Rajapaksa questions Environment Minister President Sirisena over Wilpattu deforestation issue\nbut he should remember that it all started with his dad and uncle and might be he was in a coma in that days ',
    ' Opposition on MP Namal Rajapaksa questions Environment Minister President Sirisena over Wilpattu deforestation issue',
    'Pawu meya ba meyage thathata oka deddi kiyana thibbane'

]

score 0 · Answer 2 · answered Oct 08 '19 at 19:34

The default NLTK tokenizer doesn't recognise the sentences here because the final ponctuation is missing. You can add it yourself before each newline "\n".

For instance:

comments1 = comments1.replace("\n", ".\n")
tokens = sent_tokenize(comments1)
for token in tokens:
    print("sentence: " + token)

You get something like that (truncated for readability):

sentence: 1, Opposition MP Namal Rajapaksa questions Environment Ministe [...] Sirisena over Wilpattu deforestation issue.
sentence: 2, but he should remember that it all started with his dad and [...] a coma in that days .
sentence: 3, Opposition MP Namal Rajapaksa questions Environment Ministe [...] Sirisena over Wilpattu deforestation issue.
sentence: 4, Pawu meya ba ba meyage  [...] 
sentence: 5, We visited Wilpaththu in August 2013 These are some of the  [...] deforestation of Wilpattu as Srilankans .
sentence: 6, Mamath wiruddai .
sentence: 7, Yeah we should get together and do something.
sentence: B, Mama Kamathyi oka kawada hari wenna tiyana deyak .
sentence: 9, Yes Channa Karunaratne we should stand agaist this act dnt  [...] as per news spreading Pls .
sentence: 10, LTTE eken elawala daapu oya minissunta awurudu 30kin passe [...] sathungena balaneka thama manussa kama .
sentence: 11, Shanthar mahaththayo ow aththa eminisunta idam denna one g [...] vikalpa yojana gena evi kiyala.
sentence: 12, You are wrong They must be given lands No one opposes it W [...]

This is working thank you Laurent but how can I use an array to get the sentences one by one ? Can you tell a possible way using an array ? — N.Perera, Oct 09 '19 at 06:03
In my code sample, *tokens* is a Python [list](https://docs.python.org/3/tutorial/datastructures.html) (an array). So `tokens[0]` is the first sentence, `tokens[1]` the second, and so on… — Laurent LAPORTE, Oct 09 '19 at 09:53

caot · Answer 3 · 2019-10-08T20:31:17.493

Read the comments in the following.

# Standard sentence tokenizer.
def sent_tokenize(text, language='english'):
    """
    Return a sentence-tokenized copy of *text*,
    using NLTK's recommended sentence tokenizer
    (currently :class:`.PunktSentenceTokenizer`
    for the specified language).

    :param text: text to split into sentences
    :param language: the model name in the Punkt corpus
    """
    tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
    return tokenizer.tokenize(text)


def tokenize(self, text, realign_boundaries=True):
    """
    Given a text, returns a list of the sentences in that text.
    """
    return list(self.sentences_from_text(text, realign_boundaries))

Since language='english' takes !, ?, . ... as the end of a sentence, it works to add comments1 = comments1.replace('\n', '. ') before sent_tokenize(comments1).

Your case is possibly dulicated as the nltk sentence tokenizer, consider new lines as sentence boundary

How can I do sentence tokenization

Dataset

3 Answers3