Questions tagged [docx]

.docx is the file extension for files created using the default format of Microsoft Word 2007 or higher. Use this tag when you are working with .docx files programmatically, such as generating .docx, extracting data from .docx or editing a .docx

.docx is the file extension for files created using the default format of Microsoft Word 2007 or higher. This is the Microsoft Office Open XML WordProcessingML format. This format is based around a zipped collection of eXtensible Markup Language (XML) files. Microsoft Office Open XML WordProcessingML is mostly standardized in ECMA 376 and ISO 29500.

Formerly, Microsoft used the BIFF (Binary Interchange File Format) binary format (.xls, .doc, .ppt). It now uses the OOXML (Office Open XML) format. These files (.xlsx, .xlsm, .docx, .docm, .pptx, .pptm) are zipped-XML.

.docx is the new default Word format, it cannot contain any VBA (for security reasons as stated by Microsoft).
.docm is the new Word format that can store VBA and execute macros.

The .docx format is a zipped file that contains the following folders:

+--docProps
|  +  app.xml
|  \  core.xml
+  res.log
+--word //this folder contains most of the files that control the content of the document
|  +  document.xml //Is the actual content of the document
|  +  endnotes.xml
|  +  fontTable.xml
|  +  footer1.xml //Containst the elements in the footer of the document
|  +  footnotes.xml
|  +--media //This folder contains all images embedded in the word
|  |  \  image1.jpeg
|  +  settings.xml
|  +  styles.xml
|  +  stylesWithEffects.xml
|  +--theme
|  |  \  theme1.xml
|  +  webSettings.xml
|  \--_rels
|     \  document.xml.rels //this document tells word where the images are situated
+  [Content_Types].xml
\--_rels
   \  .rels

The main content of a docx file resides in word/document.xml.

A typical word/document.xml looks like this :

<w:body>
  <w:p w:rsidR="001A6335" w:rsidRPr="0059122C" w:rsidRDefault="0059122C" w:rsidP="0059122C">
    <w:r>
      <w:t>Hello </w:t>
    </w:r>
    <w:proofErr w:type="spellStart"/>
    <w:r w:rsidR="008B4316">
      <w:t>W</w:t>
    </w:r>
    <w:proofErr w:type="spellEnd"/>
    <w:r>
      <w:t>orld</w:t>
    </w:r>
    <w:bookmarkStart w:id="0" w:name="_GoBack"/>
    <w:bookmarkEnd w:id="0"/>
  </w:p>
  <w:sectPr w:rsidR="001A6335" w:rsidRPr="0059122C" w:rsidSect="001A6335">
    <w:headerReference w:type="default" r:id="rId7"/>
    <w:footerReference w:type="default" r:id="rId8"/>
    <w:pgSz w:w="12240" w:h="15840"/>
    <w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="720" w:footer="720" w:gutter="0"/>
    <w:cols w:space="720"/>
    <w:docGrid w:linePitch="360"/>
  </w:sectPr>
</w:body>

The tags are w:body (for the whole document), and then the document is separated in multiple w:p (paragraphs). And a w:sectPr, which defines the headers/footers used for that document.

Inside a w:p, there are multiple w:r (runs). Every run defines its own style (color of the text, font-size, ...), and every run contains multiple w:t (text parts).

As you can see, a simple sentence like Hello World might be separated in multiple w:t, which makes templating quite difficult to implement.

3020 questions
1
vote
1 answer

Apache POI Word .DOC Replacing Text

I would like to open a .doc file search for some text and replace it with other text. I know of the RANGE.replaceText(placeholder, newString) method but it is unreliable when you have mergfields, or other special formatting in the document and can…
user2020457
  • 165
  • 1
  • 1
  • 13
1
vote
3 answers

PDF compression How does Adobe do it?

This is a bit more of a fun question than a serious one, but how does the Adobe PDF format make documents so... portable? I just created a small Word document, 235kb in size, containing multiple color photos and a few textual phrases. A PDF…
NickSentowski
  • 820
  • 13
  • 27
1
vote
1 answer

Java DOCX file Viewer

Currently I'm developing an application that allows users to create a template and generate it into a DOCX file. The application needs to be able to display to users the changes in the template as the user is creating it. The approach I tried was…
1
vote
1 answer

POI docx paragraph outline parsing

I have a very simple issue that is driving me crazy. Basically I want to extract, via POI/DOCX4J libraries, docx paragraph structure and document outline. I did the same task with a normal doc document using the POI paragraph.getLvl() method. Is…
YoBre
  • 2,520
  • 5
  • 27
  • 37
1
vote
2 answers

How generate docx/odt file with math formulas from java

Good day. I must generete docx or odt file with many math formulas inside. i try to find solution in Apashe POI & ODFtoolkit but i am not was able. google doesn't help. ( May be anybody can help me with solution in this task? (any example?) Thanks.
1
vote
2 answers

PHP xPath docx parsing

I am trying to open up a Word 2007 document (docx), I unzip it successively but I am having an issue with the xPath portion of the code. I want to iterate each element and grab the text within the element. In the current example below I am trying…
Anderson
  • 101
  • 4
  • 12
1
vote
0 answers

Phpdocx word documents are corrupt when adding images

I'm using Phpdocx 2.5 to convert html to docx. I'm using the embedHTML method with 'downloadImages' parameter set to true; When the html doesn't contain any images, document is generated just fine. When images are added, the resulting document…
Biggie Mac
  • 1,307
  • 2
  • 13
  • 26
1
vote
1 answer

How to add item transform to VS2012 .proj msbuild file

Based off this answer describing an item transform to convert image files from jpg to png, I made an item transform that converts .docx file to .pdf. When I call it from my projectname.proj build file I get this error message: Error 1 The…
Pauli Price
  • 4,187
  • 3
  • 34
  • 62
1
vote
1 answer

Export from Java EE + Struts2 to DOC files

Someone knows any java library that allows me to export information to doc format, I appreciate variety. My project is using Java EE and STRUTS2. So I need to evaluate and to compare the options. For example JASPERREPORTS.
1
vote
1 answer

OpenTBS Multiple pages of repeated template containing table

Alright, I'm new to XML and OpenTBS so this concept of blocks etc is very confusing for me, and when I thought I had the gist of it, my client asked for even more of me. I've got a table of customers and their items, the client wants one single…
PwnageAtPwn
  • 431
  • 1
  • 6
  • 21
1
vote
1 answer

Get Xml Text node ID

I'm trying to parse through the document.xml file of a .docx file. I would like to search for Text and then return the node that text is located so I can then move up to the parent node and insert a new node type. This is what I have so far, I have…
user1704863
  • 394
  • 1
  • 6
  • 19
1
vote
0 answers

Python sends corrupt .docx as email attachment (google app engine)

I want to send an email from python with: thedoc = generate_doc() mail.send_mail(sender="Support", to="user@mail.co.uk", subject="RE: ref", attachments=('thedoc.docx', thedoc), body="""Blah…
Awalias
  • 2,027
  • 6
  • 31
  • 51
1
vote
0 answers

Does docx4j convert xhtml to docx in memory?

I'm trying to convert xhtml file to docx and find following example code: wordMLPackage.getMainDocumentPart().getContent().addAll(XHTMLImporter.convert(new File(inputfilepath), null, wordMLPackage) ); wordMLPackage.save(new…
1
vote
2 answers

.net program to parse .doc file

I want to create an application which will be able to parse doc/docx files structure of this file is shown bellow: par-000.01 - some content par-000.21 - some content par-000.31 - some content par-001.32 - some content content could be multi line…
Mithrand1r
  • 2,313
  • 9
  • 37
  • 76
1
vote
0 answers

How to convert DocX document to Microsoft.Office.Interop.Word.Document?

I want to convert or typecaste an existing DocX word doument to Microsoft.Office.Interop.Word.Document. static DocX g_document; .... .... function DoSomething() { g_document = DocX.Load(@"C:\Users\RetailWrite.docx"); …
Newton Sheikh
  • 1,376
  • 2
  • 19
  • 42
1 2 3
99
100