@usr2564301 is on the right track. Character formatting (aka. "font") is specified at the run level. This is what a run is; a "run" (sequence) of characters all sharing the same character formatting.
When you assign to shape.text
you replace all the runs that used to be there with a single new run having default formatting. If you want to preserve formatting you need to preserve whatever runs are not directly involved in the text replacement.
This is not a trivial problem because there is no guarantee runs break on word boundaries. Try printing out the runs for a few paragraphs and I think you'll see what I mean.
In rough pseudocode, I think this is the approach you would need to take:
- do your search for the target text in the paragraph to determine the offset of its first character.
- traverse all the runs in the paragraph keeping a running total of how many characters there are before each run, maybe something like (run_idx, prefix_len, length): (0, 0, 8), (1, 8, 4), (2, 12, 9), etc.
- Identify which run is the starting, ending, and in-between runs involving your search string.
- Split the first run at the start of the search term, split the last run at the end of the search term, and delete all but the first of the "middle" runs.
- Change the text of the middle run to the replacement text and clone the formatting from the prior (original start) run. Maybe this last bit you do at split-start time.
This preserves any runs that do not involve the search string and preserves the formatting of the "matched" word in the "replaced" word.
This requires a few operations that are not directly supported by the current API. For those you'd need to use lower-level lxml
calls to directly manipulate the XML, although you could get hold of all the existing elements you need from python-pptx
objects without ever having to parse in the XML yourself.