2

I'm trying to replace words in a docx file like described here:

public static void SearchAndReplace(string document)
{
    using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(document, true))
    {
        string docText = null;
        using (StreamReader sr = new StreamReader(wordDoc.MainDocumentPart.GetStream()))
        {
            docText = sr.ReadToEnd();
        }

        Regex regexText = new Regex("Hello world!");
        docText = regexText.Replace(docText, "Hi Everyone!");

        using (StreamWriter sw = new StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create)))
        {
            sw.Write(docText);
        }
    }
}

That's working fine except that sometimes for SomeTest in a document you would get something like:

    <w:t>
        Some
    </w:t>
</w:r>

<w:r w:rsidR="009E5AFA">
    <w:rPr>
        <w:b/>
        <w:color w:val="365F91"/>
        <w:sz w:val="22"/>
    </w:rPr>
    <w:t>
        Test
    </w:t>
</w:r>

And of course replacement fails. Perhaps there is a workaround to make some words unbreakable in docx? Or perhaps I'm doing replace wrong?

ren
  • 3,843
  • 9
  • 50
  • 95

1 Answers1

4

One way to solve this is normalizing the xml of your document before doing transformtions. You can make use of OpenXml Powertools to do this.

Sample code to normalize xml

 using (WordprocessingDocument doc =
            WordprocessingDocument.Open("Test.docx", true))
        {
            SimplifyMarkupSettings settings = new SimplifyMarkupSettings
            {
                NormalizeXml = true, // Merges Run's in a paragraph with similar formatting
                // Additional settings if required
                AcceptRevisions = true,
                RemoveBookmarks = true,
                RemoveComments = true,
                RemoveGoBackBookmark = true,
                RemoveWebHidden = true,
                RemoveContentControls = true,
                RemoveEndAndFootNotes = true,
                RemoveFieldCodes = true,
                RemoveLastRenderedPageBreak = true,
                RemovePermissions = true,
                RemoveProof = true,
                RemoveRsidInfo = true,
                RemoveSmartTags = true,
                RemoveSoftHyphens = true,
                ReplaceTabsWithSpaces = true
            };
            MarkupSimplifier.SimplifyMarkup(doc, settings);
        }

This will simplify the markup of Open Xml document to make further transformations easier to work with the document programatically. I always use it before working with a open xml document programatically.

More Info about using these tools can be found here and a good blog article here.

Flowerking
  • 2,551
  • 1
  • 20
  • 30
  • So, if I only use NormalizeXml = true, then replace stuff and write it back - it shouldn't change the way doc looks? – ren Apr 03 '13 at 16:03
  • 1
    It works on the Open xml markup,It is not going to change anything in terms of document output. The end document still looks the same. But you need be aware of the changes it is going to make eg. If you use `RemoveBookmarks=true` you will end up with a document without bookmarks. But normalizing xml won't change anything in the document but it normalizes and concatenates the runs with-in a paragraph. Compare the xml of both documents to see if it works to your requirements. – Flowerking Apr 03 '13 at 16:47