3

I am looking to perform text replacements in a shape's text. I am using code similar to snippet below:

# define key/value
SRKeys, SRVals = ['x','y','z'], [1,2,3]

# define text
text = shape.text

# iterate through values and perform subs
for i in range(len(SRKeys)):
    # replace text
    text = text.replace(SRKeys[i], str(SRVals[i]))

# write text subs to comment box
shape.text = text

However, if the initial shape.text has formatted characters (bolded for example), the formatting is removed on the read. Is there a solution for this?

The only thing I could think of is to iterate over the characters and check for formatting, then add these formats before writing to shape.text.

Michael Berk
  • 705
  • 7
  • 23
  • Sounds like you should change [`runs`](https://python-pptx.readthedocs.io/en/latest/api/text.html#pptx.text.text._Paragraph.runs) instead of the entire text. – Jongware Jan 30 '20 at 23:42
  • Thanks for the help! Unfortunately, formatting is removed on the read ```text = shape.text```. So you can write to a run and format the text, but you cannot "copy" the format of the original text. – Michael Berk Jan 31 '20 at 15:42

2 Answers2

2

@usr2564301 is on the right track. Character formatting (aka. "font") is specified at the run level. This is what a run is; a "run" (sequence) of characters all sharing the same character formatting.

When you assign to shape.text you replace all the runs that used to be there with a single new run having default formatting. If you want to preserve formatting you need to preserve whatever runs are not directly involved in the text replacement.

This is not a trivial problem because there is no guarantee runs break on word boundaries. Try printing out the runs for a few paragraphs and I think you'll see what I mean.

In rough pseudocode, I think this is the approach you would need to take:

  • do your search for the target text in the paragraph to determine the offset of its first character.
  • traverse all the runs in the paragraph keeping a running total of how many characters there are before each run, maybe something like (run_idx, prefix_len, length): (0, 0, 8), (1, 8, 4), (2, 12, 9), etc.
  • Identify which run is the starting, ending, and in-between runs involving your search string.
  • Split the first run at the start of the search term, split the last run at the end of the search term, and delete all but the first of the "middle" runs.
  • Change the text of the middle run to the replacement text and clone the formatting from the prior (original start) run. Maybe this last bit you do at split-start time.

This preserves any runs that do not involve the search string and preserves the formatting of the "matched" word in the "replaced" word.

This requires a few operations that are not directly supported by the current API. For those you'd need to use lower-level lxml calls to directly manipulate the XML, although you could get hold of all the existing elements you need from python-pptx objects without ever having to parse in the XML yourself.

scanny
  • 26,423
  • 5
  • 54
  • 80
  • Thanks for the answer - it worked! I wrote a less efficient solution, but it was very straight forward to code: 1) Iterate through runs and store text and format. 2) Perform text replacement on stored text. 3) Clear paragraph and write each run (with new text and format). Note: because you can't write to run.font, you have to choose which font properties to store (which isn't ideal). – Michael Berk Feb 02 '20 at 21:16
2

Here is an adapted version of the code I'm using (inspired by @scanny's answer). It replaces text for all shapes (with text frame) on a slide.

from pptx import Presentation

prs = Presentation('../../test.pptx')
slide = prs.slides[1]

# iterate through all shapes on slide
for shape in slide.shapes:
    if not shape.has_text_frame:
        continue
        
    # iterate through paragarphs in shape
    for p in shape.text_frame.paragraphs:
        # store formats and their runs by index (not dict because of duplicate runs)
        formats, newRuns = [], []

        # iterate through runs
        for r in p.runs:
            # get text
            text = r.text

            # replace text
            text = text.replace('s','xyz')

            # store run
            newRuns.append(text)

            # store format
            formats.append({'size':r.font.size,
                            'bold':r.font.bold,
                            'underline':r.font.underline,
                            'italic':r.font.italic})

        # clear paragraph
        p.clear()

        # iterate through new runs and formats and write to paragraph
        for i in range(len(newRuns)):
            # add run with text
            run = p.add_run()
            run.text = newRuns[i]

            # format run
            run.font.bold = formats[i]['bold']
            run.font.italic = formats[i]['italic']
            run.font.size = formats[i]['size']
            run.font.underline = formats[i]['underline']

prs.save('../../test.pptx')
Michael Berk
  • 705
  • 7
  • 23
  • 1
    Thanks for sharing. It is not clear why in the line "for _, r in enumerate(p.runs):" enumerate is used when in the loop only r is used. Why can it not be replaced with "for r in p.runs:" – Ger Sep 16 '21 at 10:07
  • 1
    Yeah not sure why that's there. Changing now. – Michael Berk Sep 16 '21 at 17:46
  • @MichaelBerk - would this replace text in graph headers? meaning x-axis labels of a graph? – The Great Aug 01 '22 at 05:24
  • Good question. I'm not sure how graphs are formatted under the hood, so I'm not sure. Should be easy to check though :) – Michael Berk Aug 02 '22 at 15:53