Questions tagged [docx]

.docx is the file extension for files created using the default format of Microsoft Word 2007 or higher. Use this tag when you are working with .docx files programmatically, such as generating .docx, extracting data from .docx or editing a .docx

.docx is the file extension for files created using the default format of Microsoft Word 2007 or higher. This is the Microsoft Office Open XML WordProcessingML format. This format is based around a zipped collection of eXtensible Markup Language (XML) files. Microsoft Office Open XML WordProcessingML is mostly standardized in ECMA 376 and ISO 29500.

Formerly, Microsoft used the BIFF (Binary Interchange File Format) binary format (.xls, .doc, .ppt). It now uses the OOXML (Office Open XML) format. These files (.xlsx, .xlsm, .docx, .docm, .pptx, .pptm) are zipped-XML.

.docx is the new default Word format, it cannot contain any VBA (for security reasons as stated by Microsoft).
.docm is the new Word format that can store VBA and execute macros.

The .docx format is a zipped file that contains the following folders:

+--docProps
|  +  app.xml
|  \  core.xml
+  res.log
+--word //this folder contains most of the files that control the content of the document
|  +  document.xml //Is the actual content of the document
|  +  endnotes.xml
|  +  fontTable.xml
|  +  footer1.xml //Containst the elements in the footer of the document
|  +  footnotes.xml
|  +--media //This folder contains all images embedded in the word
|  |  \  image1.jpeg
|  +  settings.xml
|  +  styles.xml
|  +  stylesWithEffects.xml
|  +--theme
|  |  \  theme1.xml
|  +  webSettings.xml
|  \--_rels
|     \  document.xml.rels //this document tells word where the images are situated
+  [Content_Types].xml
\--_rels
   \  .rels

The main content of a docx file resides in word/document.xml.

A typical word/document.xml looks like this :

<w:body>
  <w:p w:rsidR="001A6335" w:rsidRPr="0059122C" w:rsidRDefault="0059122C" w:rsidP="0059122C">
    <w:r>
      <w:t>Hello </w:t>
    </w:r>
    <w:proofErr w:type="spellStart"/>
    <w:r w:rsidR="008B4316">
      <w:t>W</w:t>
    </w:r>
    <w:proofErr w:type="spellEnd"/>
    <w:r>
      <w:t>orld</w:t>
    </w:r>
    <w:bookmarkStart w:id="0" w:name="_GoBack"/>
    <w:bookmarkEnd w:id="0"/>
  </w:p>
  <w:sectPr w:rsidR="001A6335" w:rsidRPr="0059122C" w:rsidSect="001A6335">
    <w:headerReference w:type="default" r:id="rId7"/>
    <w:footerReference w:type="default" r:id="rId8"/>
    <w:pgSz w:w="12240" w:h="15840"/>
    <w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="720" w:footer="720" w:gutter="0"/>
    <w:cols w:space="720"/>
    <w:docGrid w:linePitch="360"/>
  </w:sectPr>
</w:body>

The tags are w:body (for the whole document), and then the document is separated in multiple w:p (paragraphs). And a w:sectPr, which defines the headers/footers used for that document.

Inside a w:p, there are multiple w:r (runs). Every run defines its own style (color of the text, font-size, ...), and every run contains multiple w:t (text parts).

As you can see, a simple sentence like Hello World might be separated in multiple w:t, which makes templating quite difficult to implement.

3020 questions
24
votes
2 answers

Is there any way to read .docx file include auto numbering using python-docx

Problem statement: Extract sections from .docx file including autonumbering. I tried python-docx to extract text from .docx file but it excludes the autonumbering. from docx import Document document = Document("wadali.docx") def…
wadali
  • 2,221
  • 1
  • 20
  • 38
24
votes
7 answers

Python: Convert PDF to DOC

How to convert a pdf file to docx. Is there a way of doing this using python? I've saw some pages that allow user to upload PDF and returns a DOC file, like PdfToWord Thanks in advance
AlvaroAV
  • 10,335
  • 12
  • 60
  • 91
24
votes
5 answers

Merge multiple word documents into one Open Xml

I have around 10 word documents which I generate using open xml and other stuff. Now I would like to create another word document and one by one I would like to join them into this newly created document. I wish to use open xml, any hint would be…
Incredible
  • 3,495
  • 8
  • 49
  • 77
24
votes
4 answers

How can I debug a corrupt docx file?

I have an issue where .doc and .pdf files are coming out OK but a .docx file is coming out corrupt. In order to solve that I am trying to debug why the .docx is corrupt. I learned that the docx format is much stricter with regard to extra…
Martin Hansen Lennox
  • 2,837
  • 2
  • 23
  • 64
22
votes
4 answers

Using Vim to edit Microsoft Word files

I've found ViEmu, a vi emulator for microsoft word. However, I wanted to use vim to edit DOC or even rtf files. Is this possible ? Are they any other formats that preserve page/paragraph layout compatible with both Microsoft Word and Vim? I am also…
Kilon
  • 1,962
  • 3
  • 16
  • 23
22
votes
5 answers

OpenXML 2 SDK - Word document - Create bulleted list programmatically

Using the OpenXML SDK, 2.0 CTP, I am trying to programmatically create a Word document. In my document I have to insert a bulleted list, an some of the elements of the list must be underlined. How can I do this?
kjv
  • 11,047
  • 34
  • 101
  • 140
22
votes
4 answers

Apache POI or docx4j for dealing with docx documents

What do you think Which is better to use to read docx document as java objects and why ? in other words. which library supports most of the word tags ?
becks
  • 2,656
  • 8
  • 35
  • 64
22
votes
4 answers

Figure sizes with pandoc conversion from markdown to docx

I type a report with Rmarkdown in Rstudio. When converting it in html with knitr, there is also a markdown file produced by knitr. I convert this file with pandoc as follows : pandoc -f markdown -t docx input.md -o output.docx The output.docx file…
Stéphane Laurent
  • 75,186
  • 15
  • 119
  • 225
21
votes
7 answers

DOCX File type in PHP finfo_file is application/zip

hello I'm trying to validate an uploaded file type by finfo_file function. But when a .docx file is sent, the file type is: application/zip instead of: application/vnd.openxmlformats-officedocument.wordprocessingml.document how can I change this…
WooDzu
  • 4,771
  • 6
  • 31
  • 61
21
votes
9 answers

Convert Html to Docx in c#

i want to convert a html page to docx in c#, how can i do it?
Luis
  • 2,665
  • 8
  • 44
  • 70
21
votes
2 answers

PHP Convert Word file to HTML without losing styling and images

Is there an API for converting word files to HTML without losing the format? Can the google documents API be used for this? I tried saaspose but the returning result is always a server error. Solutions that did not work for me: Converting MS Word…
Herr
  • 2,725
  • 3
  • 30
  • 36
20
votes
5 answers

unoconv not working while trying to convert. throws Error: Unable to connect or start own listener. Aborting

I am trying to convert docx to pdf using unoconv, but getting Error: Unable to connect or start own listener. Aborting. when I run unoconv -f pdf 1234.docx. So, there must be some listener. I then started the listener via unoconv --listener. I tried…
20
votes
4 answers

Version-controlling zipped files (docx, odt)

There are formats that are actually zip files in disguise, e.g. docx or odt. If I store them directly in version control, they are handled as binary files. My ideal solution would be have a hook that creates a foo.docx/ directory for each…
Adam Schmideg
  • 10,590
  • 10
  • 53
  • 83
20
votes
3 answers

How to setup cell borders with python-docx

I need to setup cells borders in table with python-docx, but can't find how to. Please help.
Valentin
  • 201
  • 1
  • 2
  • 3
20
votes
1 answer

Parsing of table from .docx file

I want to parse a table from a .docx file using Python and python-docx into some useful data structure. The .docx file contains only a single table in my case. I've uploaded it so you can have a look. Here's a screenshot:
Sreedhar
  • 367
  • 1
  • 3
  • 8