Is there a better way to do this task?
For the pre-processing of an NLP task, I was trying to split large pieces of text into a list of strings of even length.
By splitting the text at every "." I would have very uneven sentences in length. By using an index/number I would cut off sentences in the middle.
The goal was to have sentences in a list of even length without truncating a sentence before it ends.
This is the solution I came up with but I feel like something simpler should exist.
def even_split(text):
combined_sentences = []
tmp_text = text.strip()
if tmp_text[-1] != ".":
tmp_text += "."
while len(tmp_text) > 0:
dots = []
for i in range(len(tmp_text)):
if tmp_text[i] == '.':
dots.append(i)
split_dot = dots[min(range(len(dots)), key=lambda i: abs(dots[i]-150))]
combined_sentences.append(tmp_text[:split_dot+1])
tmp_text = tmp_text[split_dot+1:].strip()
return combined_sentences
For example, if I input the following string:
Marketing products and services is a demanding and tedious task in today’s overly saturated market. Especially if you’re in a B2B lead generation business. As a business owner or part of the sales team, you really need to dive deep into understanding what strategies work best and how to appeal to your customers most efficiently. Lead generation is something you need to master. Understanding different types of leads will help you sell your product or services and scale your business faster. That’s why we’re explaining what warm leads are and how you can easily turn them into paying customers.
This will output:
['Marketing products and services is a demanding and tedious task in today’s overly saturated market. Especially if you’re in a B2B lead generation business.',
'As a business owner or part of the sales team, you really need to dive deep into understanding what strategies work best and how to appeal to your customers most efficiently.',
'Lead generation is something you need to master. Understanding different types of leads will help you sell your product or services and scale your business faster.',
'That’s why we’re explaining what warm leads are and how you can easily turn them into paying customers.']
As you can see they are evenly split at around 150 char each. Hope this is clear.
Is there a better way to do this task?
Thanks!