24

I have an issue where .doc and .pdf files are coming out OK but a .docx file is coming out corrupt.

In order to solve that I am trying to debug why the .docx is corrupt.

I learned that the docx format is much stricter with regard to extra characters than either .pdf or .doc. Therefore I have searched the various xml files WITHIN the docx file looking for invalid XML. But I can't find any. It all validates fine.

xml files I've been checking out

Could anyone suggest directions for me to investigate now?

UPDATE:

The full listing of files inside the folder is as follows:

/_rels
    .rels

/customXml
    /_rels
        .rels
    item1.xml
    itemProps1.xml

/docProps
    app.xml
    core.xml

/word
    /_rels
        document.xml.rels
    /media
        image1.jpeg
    /theme
        theme1.xml
    document.xml
    fontTable.xml
    numbering.xml
    settings.xml
    styles.xml
    stylesWithEffects.xml
    webSettings.xml

[Content_Types].xml

UPDATE 2:

I should also have mentioned that the reason for corruption is almost certainly a bad binary file POST on my behalf.

why are docx files corrupted by binary post, but .doc and .pdf are fine?

UPDATE 3:

I have tried the demo various docx repair tools. They all seem to repair the file ok but give no clue as to the cause of the error.

My next step is to examine the contents of the corrupted file with the repaired version.

If anybody knows of a docx repair tool that gives a decent error message I'd appreciate hearing about it. In fact I might post that as a separate question.

UPDATE 4 (2017)

I never solved this problem. I have tried all the tools suggested in the answers below but none of them worked for me.

I have since progressed a little further and found a block of 0000 missing when opening the .docx in Sublime Text. More details in the new question here: What could be causing this corruption in .docx files during httpwebrequest?

Community
  • 1
  • 1
Martin Hansen Lennox
  • 2,837
  • 2
  • 23
  • 64
  • I take it that your tools don’t come up with a decent error message, do they? Not even somewhere more private, like in the console? – zoul Aug 12 '13 at 18:15
  • what tools would you suggest I use to look into it? I'm a newb at this, only trying to debug the error to solve another issue. When I try to open the file in Word it comes up as corrupt (although it repairs ok). – Martin Hansen Lennox Aug 12 '13 at 18:25
  • Sorry, no idea. I was just hoping you could get a better idea about the error from the tool that’s reporting the file as corrupt. – zoul Aug 12 '13 at 18:35
  • Hmmm I've gone through every xml file in the document and I can't find an xml error. I've found lots of sites that will FIX documents, but none that will show what the problem is. Does anybody know of tools for debugging .docx files? – Martin Hansen Lennox Aug 12 '13 at 19:49
  • 1
    According to Mr Google, there *seem* to be several open source tools for repairing docx files. If one of them works for you, then maybe you can get a diagnostic message (or add your own) from it. – andy256 Aug 13 '13 at 05:41
  • Good suggestion Andy, thanks. I am more concerned about finding the cause of the error than fixing the file, but I hadn't considered that those tools might point out the problem. I'll give it a whirl. – Martin Hansen Lennox Aug 13 '13 at 12:01
  • why is your **document.xml** called **documents.xml** ? – edi9999 Aug 13 '13 at 12:44
  • You're correct, that was a typo in the post - now fixed. – Martin Hansen Lennox Aug 13 '13 at 17:22

4 Answers4

10

I used the Open XML SDK 2.5 Productivity Tool to find a problem with a broken hyperlink reference.

You have to download/install the SDK first, then the tool. The tool will open and analyze the document for problems.

bounav
  • 4,886
  • 4
  • 28
  • 33
Jeremy K
  • 181
  • 1
  • 3
  • Hi Jeremy, thanks a lot for the suggestion. It was a good one, but when I tried I couldn't get it to open my file. (http://stackoverflow.com/a/18215739/1778169). – Martin Hansen Lennox Jan 24 '14 at 21:48
  • 1
    The new location for the productivity tool is: [https://github.com/OfficeDev/Open-XML-SDK/releases/tag/v2.5](https://github.com/OfficeDev/Open-XML-SDK/releases/tag/v2.5) – Tomer Pintel Nov 16 '21 at 08:44
  • Useful tool to explore the XML in a _.docx_ file, also checkout [Blue Eyed Behamoth's answer below](https://stackoverflow.com/a/37930725/2472) validating the file file with the `OpenXmlValidator` class should be the first thing I try (because you it's built in the SDK you're most likely already using). – bounav Jun 09 '23 at 10:31
7

Many years late, but I found this which actually worked for me. (From https://msdn.microsoft.com/en-us/library/office/bb497334.aspx)

(wordDoc is a WordprocessingDocument)

using DocumentFormat.OpenXml.Validation;

        try
        {
            var validator = new OpenXmlValidator();
            var count = 0;
            foreach (var error in validator.Validate(wordDoc))
            {
                count++;
                Console.WriteLine("Error " + count);
                Console.WriteLine("Description: " + error.Description);
                Console.WriteLine("ErrorType: " + error.ErrorType);
                Console.WriteLine("Node: " + error.Node);
                Console.WriteLine("Path: " + error.Path.XPath);
                Console.WriteLine("Part: " + error.Part.Uri);
                Console.WriteLine("-------------------------------------------");
            }

            Console.WriteLine("count={0}", count);
        }

        catch (Exception ex)
        {
            Console.WriteLine(ex.Message);
        }
Blue Eyed Behemoth
  • 3,692
  • 1
  • 17
  • 27
  • This looked very promising... but trying this width the document I had gives me: `System.IO.FileFormatException: File contains corrupted data.` aaaagggghhhh! – Martin Hansen Lennox Feb 07 '17 at 23:04
  • Make sure it's not a `.doc`, they don't have XML. Only `.docx` does. If you can't open the file, try also switching the extension. You may not be converting the `doc` to a `docx` or something – Blue Eyed Behemoth Feb 10 '17 at 14:08
  • It was a docx, but the corruption ended up being pretty obscure, which is probably why nothing would open it! A few null bytes were being stripped from the end of the file. http://stackoverflow.com/questions/42102359/what-could-be-causing-this-corruption-in-docx-files-during-httpwebrequest – Martin Hansen Lennox Feb 11 '17 at 00:49
6

Usually, when there is an error with a particular XML file, Word tells you on which line of which file the error happens. So I believe the problem comes from either the Zipping of the file, either the folder structure.

Here is the folder structure of a Word file:

The .docx format is a zipped file that contains the following folders:

+--docProps
|  +  app.xml
|  \  core.xml
+  res.log
+--word //this folder contains most of the files that control the content of the document
|  +  document.xml //Is the actual content of the document
|  +  endnotes.xml
|  +  fontTable.xml
|  +  footer1.xml //Containst the elements in the footer of the document
|  +  footnotes.xml
|  +--media //This folder contains all images embedded in the word
|  |  \  image1.jpeg
|  +  settings.xml
|  +  styles.xml
|  +  stylesWithEffects.xml
|  +--theme
|  |  \  theme1.xml
|  +  webSettings.xml
|  \--_rels
|     \  document.xml.rels //this document tells word where the images are situated
+  [Content_Types].xml
\--_rels
   \  .rels

It seems that you have only what is inside the word folder, isn't it ? If this doesn't work, could you please either send the corrupted Docx or post the structure of your folders inside your zip ?

edi9999
  • 19,701
  • 13
  • 88
  • 127
  • Sorry, I should have been more explicit. I posted that shot to demonstrate because it contained the majority of the xml files that I checked. I think the structure was as you say. I'm going to double check. I went through every xml file in the folder, thinking I'd find an invalid one... but I didn't. I'll update the post. – Martin Hansen Lennox Aug 13 '13 at 11:50
-4

web docx validator worked for me : http://ucd.eeonline.org/validator/index.php

user3044482
  • 401
  • 4
  • 11
  • Ta for the suggestion, I hadn't seen this before. But in my case it threw a 500 with my corrupted file – Martin Hansen Lennox Feb 07 '17 at 23:09
  • 2
    This is a document layout validator. It doesn't validate the XML structure. As it mentions on the page: **The validation tool will look at font type, font size, graphics, and tables of your document and make recommendations on its accessibility** – Dark Star1 Mar 09 '18 at 10:22
  • Link is offline – wybe Jul 13 '22 at 12:30