I'm using the "Extract all text from slides in presentation" example at https://python-pptx.readthedocs.io/en/latest/user/quickstart.html to extract text from some PowerPoint slides.
from pptx import Presentation
prs = Presentation(path_to_presentation)
# text_runs will be populated with a list of strings,
# one for each text run in presentation
text_runs = []
for slide in prs.slides:
for shape in slide.shapes:
if not shape.has_text_frame:
continue
for paragraph in shape.text_frame.paragraphs:
for run in paragraph.runs:
text_runs.append(run.text)
It seems to be working fine, except that I'm getting odd splits in some of the text_runs. Things that I'd expect would be grouped together are being split up, and with no obvious pattern that I can detect. For example, sometimes the slide title is split into two parts, and sometimes it isn't
I've discovered that I can eliminate the odd splits by retyping the text on the slide but that doesn't scale.
I can't, or at least don't want to, merge the two parts of the split text together, because sometimes the second part of the text has been merged with a different text run. For example, on the slide deck's title slide, the title will be split in two, with the second part of the title merged with the title slide's subtitle text.
Any suggestions on how to eliminate the odd / unwanted splits? Or is this behavior more-or-less to be expected when reading text from a PowerPoint?