4

Is there a better way to do this task?

For the pre-processing of an NLP task, I was trying to split large pieces of text into a list of strings of even length.

By splitting the text at every "." I would have very uneven sentences in length. By using an index/number I would cut off sentences in the middle.

The goal was to have sentences in a list of even length without truncating a sentence before it ends.

This is the solution I came up with but I feel like something simpler should exist.

def even_split(text):
    combined_sentences = []
    tmp_text = text.strip()
    if tmp_text[-1] != ".":
      tmp_text += "."
    while len(tmp_text) > 0:
      dots = []
      for i in range(len(tmp_text)):
        if tmp_text[i] == '.':
          dots.append(i)
      split_dot = dots[min(range(len(dots)), key=lambda i: abs(dots[i]-150))]
      combined_sentences.append(tmp_text[:split_dot+1])
      tmp_text = tmp_text[split_dot+1:].strip()
    return combined_sentences

For example, if I input the following string:

Marketing products and services is a demanding and tedious task in today’s overly saturated market. Especially if you’re in a B2B lead generation business. As a business owner or part of the sales team, you really need to dive deep into understanding what strategies work best and how to appeal to your customers most efficiently. Lead generation is something you need to master. Understanding different types of leads will help you sell your product or services and scale your business faster. That’s why we’re explaining what warm leads are and how you can easily turn them into paying customers.

This will output:

['Marketing products and services is a demanding and tedious task in today’s overly saturated market. Especially if you’re in a B2B lead generation business.',
 'As a business owner or part of the sales team, you really need to dive deep into understanding what strategies work best and how to appeal to your customers most efficiently.',
 'Lead generation is something you need to master. Understanding different types of leads will help you sell your product or services and scale your business faster.',
 'That’s why we’re explaining what warm leads are and how you can easily turn them into paying customers.']

As you can see they are evenly split at around 150 char each. Hope this is clear.

Is there a better way to do this task?

Thanks!

mozway
  • 194,879
  • 13
  • 39
  • 75
alb
  • 57
  • 1
  • 8
  • 2
    Why aren't you just using the `split()` method to split the text at every period? Looks like you're reinventing the wheel. –  Jan 13 '22 at 14:11
  • It seems like you're a beginner, which is great! hats off to your well formatted and thorough question. – Warlax56 Jan 13 '22 at 14:18

2 Answers2

1

IIUC, you want to split the text on dot, but try to keep a minimal length of the chunks to avoid having very short sentences.

What you can do is to split on the dots and join again until you reach a threshold (here 200 characters):

out = []
threshold = 200
for chunk in text.split('. '):
    if out and len(chunk)+len(out[-1]) < threshold:
        out[-1] += ' '+chunk+'.'
    else:
        out.append(chunk+'.')

output:

['Marketing products and services is a demanding and tedious task in today’s overly saturated market. Especially if you’re in a B2B lead generation business.',
 'As a business owner or part of the sales team, you really need to dive deep into understanding what strategies work best and how to appeal to your customers most efficiently.',
 'Lead generation is something you need to master. Understanding different types of leads will help you sell your product or services and scale your business faster.',
 'That’s why we’re explaining what warm leads are and how you can easily turn them into paying customers..']
mozway
  • 194,879
  • 13
  • 39
  • 75
0

Use .split('.') to do what you're describing. this describes .split for beginners, and this stack overflow solution describes some more complex usage.

Warlax56
  • 1,170
  • 5
  • 30