0

I have written some python code to generate an extractive summary of a txt file. I am getting a IndexError: list index out of range error for this line of my code ;

   for i in range(top_n):
      summarize_text.append(" ".join(ranked_sentence[i][1]))

I was following a tutorial to implement the process. https://towardsdatascience.com/understand-text-summarization-and-create-your-own-summarizer-in-python-b26a9f09fc70 I didn't find uch help from its reviews or comments. I tried searching for similar problems here with no avail.

My full code ;

from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx

def read_article(file_name):
    file = open(file_name, "r+",encoding="utf-8")
    filedata = file.readlines()
    article = filedata[0].split(". ")
    sentences = []

    for sentence in article:
        print(sentence)
        sentences.append(sentence.replace("[^a-zA-Z]", " ").split(" "))
    sentences.pop() 

    return sentences

def sentence_similarity(sent1, sent2, stopwords=None):
    if stopwords is None:
        stopwords = []

    sent1 = [w.lower() for w in sent1]
    sent2 = [w.lower() for w in sent2]

    all_words = list(set(sent1 + sent2))

    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)

    # build the vector for the first sentence
    for w in sent1:
        if w in stopwords:
            continue
        vector1[all_words.index(w)] += 1

    # build the vector for the second sentence
    for w in sent2:
        if w in stopwords:
            continue
        vector2[all_words.index(w)] += 1

    return 1 - cosine_distance(vector1, vector2)

def build_similarity_matrix(sentences, stop_words):
    # Create an empty similarity matrix
    similarity_matrix = np.zeros((len(sentences), len(sentences)))

    for idx1 in range(len(sentences)):
        for idx2 in range(len(sentences)):
            if idx1 == idx2: #ignore if both are same sentences
                continue 
            similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)

    return similarity_matrix


def generate_summary(file_name, top_n=5):
    stop_words = stopwords.words('english')
    summarize_text = []

    # Step 1 - Read text anc split it
    sentences =  read_article(file_name)

    # Step 2 - Generate Similary Martix across sentences
    sentence_similarity_martix = build_similarity_matrix(sentences, stop_words)

    # Step 3 - Rank sentences in similarity martix
    sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)
    scores = nx.pagerank(sentence_similarity_graph)

    # Step 4 - Sort the rank and pick top sentences
    ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)    
    print("Indexes of top ranked_sentence order are ", ranked_sentence)    

 # **THE ERROR**   
 for i in range(top_n):
      summarize_text.append(" ".join(ranked_sentence[i][1]))

    # Step 5 - Offcourse, output the summarize texr
    print("Summarize Text: \n", ". ".join(summarize_text))

# let's begin
generate_summary( "F:\\Girivraaj\\tmp\\document8.txt", 2)

The error is being shown for ;

 for i in range(top_n):
      summarize_text.append(" ".join(ranked_sentence[i][1]))

( shown in bold in the full code )

The expected result would be a summary.

Surja Ray
  • 1
  • 1

2 Answers2

0

I noticed that this code will work with some texts but not others. The same error appeared for me but once I removed the whitespaces between the paragraphs, it ran with no problem. I think it might be sensitive to certain special characters.

0

There is a problem with your script, when it deals with regex and new line. Also, i didnt got the exact use if sentence.pop() ??

change the read_article function with below code..

def read_article(file_name):
    sentences = []
    file = open(file_name, 'r') 
    f_data = file.readlines()
    f_data = [x for x in f_data if x != '\n'] # it should remove any break present
    f_data = [x.replace('\n',' ') for x in f_data] #this would remove that end of line
    f_data = ''.join(f_data) 
    article = f_data.split('. ') 
    for sentence in article:
        sentences.append(sentence.replace("^[a-zA-Z0-9!@#$&()-`+,/\"]", " ").split(" "))
    return sentences
flaxel
  • 4,173
  • 4
  • 17
  • 30
  • 1
    Welcome. Thanks for your contribution. Please explain your code changes, and why. Highlight why/how your code solves the OP's issue. Code only responses are discouraged on SO. Most upvotes are gained over time as future visitors learn something from your answer that they can apply to their own coding issues. – SherylHohman Nov 12 '20 at 18:40