I am using the below methodology to summarize longer than 1024 token size long texts.
Current method splits the text by half. I took this from another user's post and modified it slightly.
So what I want to do is, instead of splitting into half, split whole text into 1024 equal sized tokens and get summarization each of them and then at the end, concatenate them with the correct order and write into file. How can I do this tokenization and getting the correct output?
text split with Split(" ")
doesn't work same as tokenization. It produces different count.
import logging
from transformers import pipeline
f = open("TextFile1.txt", "r")
ARTICLE = f.read()
summarizer = pipeline("summarization", model="facebook/bart-large-cnn" )
counter = 1
def summarize_text(text: str, max_len: int) -> str:
global counter
try:
#logging.warning("max_len " + str(max_len))
summary = summarizer(text, min_length=30, do_sample=False)
with open('parsed_'+str(counter)+'.txt', 'w') as f:
f.write(text)
counter += 1
return summary[0]["summary_text"]
except IndexError as ex:
logging.warning("Sequence length too large for model, cutting text in half and calling again")
return summarize_text(text=text[:(len(text) // 2)], max_len=max_len) + " " + summarize_text(text=text[(len(text) // 2):], max_len=max_len)
gg = summarize_text(ARTICLE, 1024)
with open('summarized.txt', 'w') as f:
f.write(gg)