Automated extraction of word docx archive and pretty print conversion of xml files

Question

If renaming eg. a file document.docx to document.docx.unzipped.zip it is possiple to extract that archive eg. to a folder 'document.docx.unzipped'. Unfortunatly the extracted xml-files are not very readable since all xml-information is in one single line.

I would like to automate the process of extracting a docx archive and converting all xml-files from the archive resp. the extraction folder (document.docx.unzipped) to readable/prettyprinted versions (like Notepad++ --> Extensions --> XML Tools --> Pretty Print (XML only with line breaks))

Any ideas for a quick approach?

EDIT1: modified Idea from https://stackoverflow.com/users/1761490/pawel-jasinski

#!/bin/sh


# this scripts unpacks and reformat docx files
#
# you need xslt processor (Transform) in your path
# /c/Program Files/Saxonica/SaxonHE9.4N/bin/Transform
#
# make sure to copy remove-rsid.xslt and copy.xslt
if [ "$1" = "-r" ]; then
    remove_rsid=1
    shift
fi

if [ "$1" = "" ]; then
    echo expected name of the word document to be exploded
    exit 1
fi
suffix=${1##*.}
name="$1"

if [ "$suffix" = "xml" ]; then
    suffix=docx
    name=${1/%.xml/.docx}
fi

if [ "$suffix" = "$1" ]; then
    suffix=docx
    name=$1.docx
fi


corename=$(basename "$name" .$suffix)
if [ -z "$corename" ]; then
    echo can not work with empty name
    exit 1
fi

DIR="$( cd "$( dirname "$0" )" && pwd )"
DOSDIR=$(cygpath -m $DIR)
FLAT=$PWD/$corename.tmp/flat.$$
FLATOUT=$PWD/$corename.tmp/flat.$$.out


if [ "$remove_rsid" == "1" ]; then
    transform=$DOSDIR/remove-rsid.xslt
else
    transform=$DOSDIR/copy.xslt
fi

# $1 - file name
# 
# formats file as xml
_reformat_xml() {
    echo reformat $1
    #read pause
    xmllint --format $1 -o $1.new
    mv $1.new $1
}

flaten() {
    # xml
    xmls=""
    pwd
    pwd
    #read pause
    for f in $(find . -name '*.xml'); do  
        ff=$(echo ${f#./} | tr '/' '@')
        echo mv $f $FLAT/$ff
        mv $f $FLAT/$ff
        _reformat_xml $FLAT/$ff
        xmls="$xmls $ff"
    done

    # for rels, rename into .xml
    rels=""
    for f in $(find . -name '*.rels'); do  
        ff=$(echo ${f#./} | tr '/' '@')
        rels="$rels $ff"
        mv $f $FLAT/$ff.xml
        _reformat_xml $FLAT/$ff.xml
        #read pause
    done
}

expand_dirs() {
    target_dir=$(pwd)
    cd $FLATOUT

    echo PDW: $PWD
    #read pause

    for f in $rels ; do
        ff=$(echo $f | tr '@' '/')
        mv $f.xml "$target_dir/$ff"
    done

    for f in $xmls ; do
        echo PDW: $PWD
        #read pause
        ff=$(echo $f | tr '@' '/')
        mv $f "$target_dir/$ff"
    done
    cd "$target_dir"
}

echo corename: $corename
read pause
if [ -e "$corename" ]; then
    if [ -e "$corename.bak" ];then
        # echo removing $corename.bak
        rm -rf "$corename.bak"
    fi
    # echo backing up $corename
    mv "$corename" "$corename.bak"
fi 


mkdir "$corename"
cd "$corename"
unzip -q "../$name"

if [ -a $FLAT ]; then
    rm -rf $FLAT
fi
mkdir -p $FLAT

flaten

if [ -a $FLATOUT ]; then
    rm -rf $FLATOUT
fi
mkdir -p $FLATOUT
#exit

#dosflat=$(cygpath -m $FLAT)
#Transform -xsl:$transform -s:$dosflat -o:$dosflat.out
cp -R $FLAT/* $FLATOUT

expand_dirs

read pause #
rm -rf $FLAT $FLATOUT

What is your preferred programming language: C#, Java, something else? — JasonPlutext, May 20 '14 at 19:55
I first thought to do this with a combination of windows batch and some commandline tool for the pretty print. On the other hand I dont like the windows batch "language" very much even though I have collected some experience recently. — grenix, May 21 '14 at 07:07
So if there is an elegant way to do this by a "real" programming language, I would prefer C# or pure C/C++. — grenix, May 21 '14 at 07:13
Given you mention C#, maybe you use Visual Studio? Are you aware you can drag a docx onto Visual Studio, see a tree of its contents, click on the part of interest, then click the format button to pretty print? You can edit/save, and Visual Studio automatically re-zips. Very useful, imho. Or is your objective something different? — JasonPlutext, May 21 '14 at 09:28
At least for Visual Studio 2010 I cant see anything interesting happen except that word is started. My objectives tend to be rather didactical. One (maybe practical) objective I have in mind is comparing the styles,layout, ... of two different documents manually or with the help of a standard diff tool. Or you know something like an "advanced docx property reporter" tool? — grenix, May 21 '14 at 10:55
Oh you need to install the Package Editor power tool - http://visualstudiogallery.msdn.microsoft.com/450a00e3-5a7d-4776-be2c-8aa8cec2a75b — JasonPlutext, May 21 '14 at 12:01
Thanks for the hints :) Oh, I found somthing which might also be helpful for the prettyprinting part: http://stackoverflow.com/questions/3063020/net-xml-pretty-printer — grenix, May 21 '14 at 16:01

score 1 · Answer 1 · answered May 21 '14 at 11:31

1

If you ever used cygwin, it includes xmllint which in turn has the --format option. This was my original approach. However xmllint did not format attributes the way I liked, so I have developed my own script. Since the word documents contain a lot of rsid noise, the script has an option to remove it.

I use the following worklflow:

get a word document, let say foo.docx
explode-docx -r foo.docx
edit foo.docx - make a small change
explode-docx -r foo.docx
kdiff3 foo foo.bak

answered May 21 '14 at 11:31

Pawel Jasinski

796
3
10

I actually found a cygwin installation om my pc so I got tempted to try this. (See EDIT 1 in the question of the top of the htread) – grenix May 21 '14 at 15:32
I tried to restore your original approach without xlst processing with the Saxonica package. BTW I had some problems with the odd character used for the tr command wich I replaced with '@'. Well I realized too late that the flattening and expanding was only for xlst processing. Anyhow Thanks :) – grenix May 21 '14 at 15:46
the flattening is only for performance reasons. xslt processor is java based, so it is expensive to call it in a loop. Saxonica has a batch processing mode, but it only works when all files are in the same directory. – Pawel Jasinski May 22 '14 at 16:39

Automated extraction of word docx archive and pretty print conversion of xml files

1 Answers1