python-pptx - How to replace keyword across multiple runs?

Question

I have two PPTs (File1.pptx and File2.pptx) in which I have the below 2 lines

XX NOV 2021, Time: xx:xx – xx:xx hrs (90mins)
FY21/22 / FY22/23

I wish to replace like below

a) NOV 2021 as NOV 2022.

b) FY21/22 / FY22/23 as FY21/22 or FY22/23.

But the problem is my replacement works in File1.pptx but it doesn't work in File2.pptx.

When I printed the run text, I was able to see that they are represented differently in two slides.

def replace_text(replacements:dict,shapes:list):
    for shape in shapes:
        for match, replacement in replacements.items():
            if shape.has_text_frame:
                if (shape.text.find(match)) != -1:
                    text_frame = shape.text_frame
                    for paragraph in text_frame.paragraphs:
                        for run in paragraph.runs:
                            cur_text = run.text
                            print(cur_text)
                            print("---")
                            new_text = cur_text.replace(str(match), str(replacement))
                            run.text = new_text

In File1.pptx, the cur_text looks like below (for 1st keyword). So, my replace works (as it contains the keyword that I am looking for)

But in File2.pptx, the cur_text looks like below (for 1st keyword). So, replace doesn't work (because the cur_text doesn't match with my search term)

The same issue happens for 2nd keyword as well which is FY21/22 / FY22/23.

The problem is the split keyword could be in previous or next run from current run (with no pattern). So, we should be able to compare a search term with previous run term (along with current term as well). Then a match can be found (like Nov 2021) and be replaced.

This issue happens for only 10% of the search terms (and not for all of my search terms) but scary to live with this issue because if the % increases, we may have to do a lot of manual work. How do we avoid this and code correctly?

How do we get/extract/find/identify the word that we are looking for across multiple runs (when they are indeed present) like CTRL+F and replace it with desired keyword?

Any help please?

UPDATE - Incorrect replacements based on matching

Before replacement

After replacement

My replacement keywords can be found below

replacements = { 'How are you?': "I'm fine!",
                'FY21/22':'FY22/23',
                'FY_2021':'FY21/22',
                'FY20/21':'FY21/22',
                'GB2021':'GB2022',
                'GB2020':'GB2022',
                'SEP-2022':'SEP-2023',
                'SEP-2021':'SEP-2022',
                'OCT-2021':'OCT-2022',
                'OCT-2020':'OCT-2021',
                'OCT 2021':'OCT 2022',
                'NOV 2021':'NOV 2022',
                'FY2122':'FY22/23',
                'FY2021':'FY21/22',
                'FY1920':'FY20/21',
                'FY_2122':'FY22/23',
                'FY21/22 / FY22/23':'FY21/22 or FY22/23',
                'F21Y22':'FY22/23',
                'your FY20 POS FCST':'your FY22/23 POS FCST',
                'your FY21/22 POS FCST':'your FY22/23 POS FCST',
                'Q2/FY22/23':'Q2-FY22/23',
                'JAN-22':'JAN-23',
                'solution for FY21/22':'solution for FY22/23',
                'achievement in FY20/21':'achievement in FY21/22',
                'FY19/20':'FY20/21'}

I just realised that you can not use a dictionary to specify the match-replacement pairs, because if you take your two replacement defs `'OCT-2020':'OCT-2021'` and `'OCT 2021':'OCT 2022'` you'll get different outcomes depending on which of the two is processed first and which second. At the end `'OCT-2020'` may actually be replaced by `'OCT-2022'`! And with a Python dictionary you do not have any guarantee in what sequence the items are presented. It is better to use a list of tuples (match, replacement). That way you can define the sequence of the replacements to avoid such problems. — Frank, Aug 07 '22 at 23:09

Frank · Accepted Answer · 2022-08-12T21:08:16.803

As one can find in python-pptx's documentation at https://python-pptx.readthedocs.io/en/latest/api/text.html

a text frame is made up of paragraphs and
a paragraph is made up of runs and specifies a font configuration that is used as the default for it's runs.
runs specify part of the paragraph's text with a certain font configuration - possibly different from the default font configuration in the paragraph

All three have a field called text:

The text frame's text contains all the text from all it's paragraphs concatenated together with the appropriate line-feeds in between the paragraphs.
The paragraphs's text contains all the texts from all of it's runs concatenated to a long string with a vertical tab character (\v) put wherever there was a so-called soft-break in any of the run's text (a soft break is like a line-feed but without terminating the paragraph).
The run's text contains text that is to be rendered with a certain font configuration (font family, font size, italic/bold/underlined, color etc. pp). It is the lowest level of the font configuration for any text.

Now if you specify a line of text in a text-frame in a PowerPoint presentation, this text-frame will very likely only have one paragraph and that paragraph will have just one run.

Let's say that line says: Hi there! How are you? What is your name? and is all normal (neither italic nor bold) and in size 10.

Now if you go ahead in PowerPoint and make the questions How are you? What is your name? stand out by making them italic, you will end up with 2 runs in our paragraph:

Hello there! with the default font configuration from the paragraph
How are you? What is you name? with the font configuration specifying the additional italic attribute.

Now imagine, we want the How are you? stand out even more by making it bold and italic. We end up with 3 runs:

Hello there! with the default font configuration from the paragraph.
How are you? with the font configuration specifying the BOLD and ITALIC attribute
What is your name? with the font configuration specifying the ITALIC attribute.

One step further, making the are in How are you? bigger. We get 5 runs:

Hello there! with the default font configuration from the paragraph.
How with the font configuration specifying the BOLD and ITALIC attribute
are with the font configuration specifying the BOLD and ITALIC attribute and font size 16
you? with the font configuration specifying the BOLD and ITALIC attribute
What is your name? with the font configuration specifying the ITALIC attribute.

So if you try to replace the How are you? with I'm fine! with the code from your question, you won't succeed, because the text How are you? is actually distributed across 3 runs.

You can go one level higher and look at the paragraph's text, that still says Hello there! How are you? What is your name? since it is the concatenation of all its run's texts.

But if you go ahead and do the replacement of the paragraph's text, it will erase all runs and create one new run with the text Hello there! I'm fine! What is your name? all the while deleting all the formatting that we put on the What is your name?.

Therefore, changing text in a paragraph without affecting formatting of the other text in the paragraph is pretty involved. And even if the text you are looking for has all the same formatting, that is no guarantee for it to be within one run. Because if you - in our example above - make the are smaller again, the 5 runs will very likely remain, the runs 2 to 4 just having the same font configuration now.

Here is the code to produce a test presentation with a text box containing the exact paragraph runs as given in my example above:

from pptx import Presentation
from pptx.chart.data import CategoryChartData
from pptx.enum.chart import XL_CHART_TYPE,XL_LABEL_POSITION
from pptx.util import Inches, Pt
from pptx.dml.color import RGBColor
from pptx.enum.dml import MSO_THEME_COLOR

# create presentation with 1 slide ------
prs = Presentation()
slide = prs.slides.add_slide(prs.slide_layouts[5])
textbox_shape = slide.shapes.add_textbox(Pt(200),Pt(200),Pt(30),Pt(240))
text_frame = textbox_shape.text_frame
p = text_frame.paragraphs[0]
font = p.font
font.name = 'Arial'
font.size = Pt(10)
font.bold = False
font.italic = False
font.color.rgb = RGBColor(0,0,0)

run = p.add_run()
run.text = 'Hello there! '

run = p.add_run()
run.text = 'How '
font = run.font
font.italic = True
font.bold = True

run = p.add_run()
run.text = 'are'
font = run.font
font.italic = True
font.bold = True
font.size = Pt(16)

run = p.add_run()
run.text = ' you?'
font = run.font
font.italic = True
font.bold = True

run = p.add_run()
run.text = ' What is your name?'
run.font.italic = True

prs.save('text-01.pptx')

And this is what it looks like, if you open it in PowerPoint:

Now if you install the python code from my GitHub repository at https://github.com/fschaeck/python-pptx-text-replacer by running the command

python -m pip install python-pptx-text-replacer

and after successful installation run the command

python-pptx-text-replacer -m "How are you?" -r "I'm fine!" -i text-01.pptx -o text-02.pptx

the resulting presentation text-02.pptx will look like this:

As you can see, it mapped the replacement string exactly onto the existing font-configurations, thus if your match and it's replacement have the same length, the replacement string will retain the exact format of the match.

But - as an important side-note - if the text-frame has auto-size or fit-frame switched on, even all that work won't save you from screwing up the formatting, if the text after the replacement needs more or less space!

If you got issues with this code, please use the possibly improved version from GitHub first. If your problem remains, use the GitHub issue tracker to report it. The discussion of this question and answer is already getting out of hand. ;-)

Nice and detailed answer.Appreciate it. but i guess am aware of the text storage in powerpoint. The problem is with the code. How can we do this programmatically? Could this be even done? Hence, i didn't mark it as answer yet as i am looking for help with the code to solve this problem — The Great, Aug 06 '22 at 03:13
@TheGreat I figured! And writing `I don't have the time to program it out right now.` didn't mean I won't find the time. ;-) I added the according code to my answer and hope that fixes your problem with all the presentations. There might still be some edge-cases I didn't take into account, but as far as I can see, it should work. — Frank, Aug 06 '22 at 09:03
Thanks for the detailed answer. Will try it out executing line by line to understand and get back to you for any queries.. marked the answer for such a great effort to write this code. Will try and get back to you — The Great, Aug 06 '22 at 13:30
One idea is instead of doing this for all keywords in my ppt, may be I can follow my code in the post (and replace as usual as it covers for now 90% of my keywords) amd pick the non matched ones amd feed them to your code. So, i can reduce the iterations amd also keep a track of words (from tweaked code of mine) to store words which are across multiple runs. — The Great, Aug 06 '22 at 13:36
I know you talk about certain edge cases. Can I check for an example that you think may not work? For my problem, they are mostly of the same format (but just curious to know under what scenarios your code may not work) because it works for one of my keywords.. — The Great, Aug 06 '22 at 13:43
@TheGreat I have no particular edge cases in mind. I just didn’t test the code very thoroughly. The above example tests just the case, where a match starts at the beginning of a run, crosses three runs and ends exactly with the third one. Even though I think my code covers all the other cases as well (starting/ending in the middle of a run, being exactly one run, etc. pp.) it isn’t tested with all the possible cases. That’s all. But if you want to be sure, you could extend the test ppt I create with the first code block and **make sure** it works for all cases. — Frank, Aug 06 '22 at 14:17
Not sure, why you would want to run your code to cover 90% of the cases and then my code that would cover 100% of the cases… but so as you wish. ;-) If you need a report of matches that cross runs, add a few print statements in my code. That should be a lot easier than having to check the results of two replacement runs… — Frank, Aug 06 '22 at 14:21
Unfortunately, the code failed for lot of cases like above. I updated my post at the top where I wish to replace `Nov 2021` with `Nov 2022` but you can see that your code replaced incorrectly and also modified some not related keywords like `XX`. Out of 20 keywords or so, It failed for 15 plus keywords etc — The Great, Aug 07 '22 at 10:04
Moreover, your code doesn't replace text inside table cells I guess (which my code does). Nonetheless, I appreciate your help. I am leaving the answer as it is. Hope it may be useful for others. but didn't solve my problem. — The Great, Aug 07 '22 at 10:14
`FY2021` becomes `FY21FY`. I don't know why and how it becomes like this. — The Great, Aug 07 '22 at 10:15
@TheGreat Can you send me one of the files that my code doesn't work for? I'd really like to know why. There might be still some PowerPoint specialities that need to be taken care of in a specific way. I could imagine, that there are runs that somehow provide a different number of characters to the paragraph's text field than they themselves contain. That would cause my code to get off track and replace the wrong text. But to fix that I need an example... — Frank, Aug 07 '22 at 16:58
@TheGreat Since this is getting far to involved to do the development of this replacement-function here on stackoverflow, I created a repository at GitHub.com with a much extended script that handles grouped shapes and tables as well. Just download the python-pptx-text-replacer.py from the repository https://github.com/fschaeck/python-pptx-text-replacer and try that - see README.md for usage. And we can use Github's issue tracker to get everything fixed for your presentations. — Frank, Aug 07 '22 at 22:53
Unfortunately, i am unable to share exxact ppt as it is confidential. When I manually reproduce it, the runs and paragraph info gets modified and am not able to recreate the old one. So, how do you rhink we should do this? — The Great, Aug 09 '22 at 02:37
@TheGreat I am going to copy my script from GitHub into this answer. It shows a very detailed analysis of the text-frames, paragraphs and runs with the exact text it finds in those objects and the exact replacements it is doing. Maybe if you run that script against your pptx files, you'll see, where it goes wrong and that could help us fix it. — Frank, Aug 09 '22 at 06:05
I went tbrough the code. Am not a coder though but if I have to replace keyword "Fy2021" with "Fy2022", I should pyt them under 'metavar' placeholder...-m amd -r Am i right? But yes, thanks a lot for detailed answer. Will try once again and update — The Great, Aug 09 '22 at 16:43
@TheGreat You don‘t need to change any code! If you take the code from the answer and save it as python-pptx-text-replacer.py - or better download the script file python-pptx-text-replacer.py from the GitHub repository I referenced in the answer (I added changing text in chart categories) - you can simply run `python ./python-pptx-text-replacer.py -m 'Fy2021' -r 'Fy2022' -i '' -o ''` — Frank, Aug 09 '22 at 17:51
If I want to use this in jupyter notebook, I copied all this into the jupyter cell. Now I want to be able to pass a list of keywords with match and replace term. If you see in my code above, I create a replacement dict (we will use list for your code) and put all of them at once (in a jupyter cell) and execute them by clicking `run` and it triggers the code. Is there anyway to do something similar here? Sorry, for being naive. for people like me (non-coders) jupyter notebook is bit straight forward and easy to use. I don't know how to use `-m "How are you?" -r "I'm fine!"` in jupyter notebook — The Great, Aug 10 '22 at 09:45
@TheGreat How is it going? Is the code working for you in 100% of the cases now? I created a Python package python-ppty-text-replacer on PyPI for the code, so it is now installable via PIP. Also one question from my side: How did you manage to get python-pptx running in Jupyter? Are you using a private installation? On https://jupyter.org/try-jupyter/lab/ I couldn't even get it installed... — Frank, Aug 13 '22 at 07:49

python-pptx - How to replace keyword across multiple runs?

1 Answers1

Linked