1

With the following MS Word document which only contains two bullet points of separate lists each encapsulated in one-cell tables.

Screenshot of Input Document

How do I use the Word document's underlying document.xml, numbering.xml, and styles.xml, to capture the type of bullet point (i.e., circle or square)? Reading the http://officeopenxml.com docs and other SO posts, I attempted the following to no avail:

  1. With document.xml, retrieve $num_id = w:numPr/w:numId/@w:val and $lvl_id = w:numPr/w:ilvl/@w:val values.

  2. With numbering.xml, using above $num_id value, retrieve $abs_id = w:num[@w:numId = $num_id]/w:abstractNumId/@w:val to return the corresponding value: w:abstractNum[@w:abstractNumId = $abs_id]/w:lvl[@w:ilvl = $lvl_id]/w:lvlText/@w:val

    However, this result is not correct as both return as square bullet.

  3. With styles.xml, review the ListParagraph w:style for any additional matching criteria.

    However, no unique identifiers or values appear useful. What am I missing?


See relevant section of the XML documents. Please advise if other sections or documents are relevant.

document.xml

            <w:p w14:paraId="16A4A39D"
                 w14:textId="10E79F44"
                 w:rsidR="00DB3D99"
                 w:rsidRPr="00D6457F"
                 w:rsidRDefault="00DB3D99"
                 w:rsidP="007205D3">
               <w:pPr>
                  <w:pStyle w:val="ListParagraph"/>
                  <w:keepNext/>
                  <w:numPr>
                     <w:ilvl w:val="0"/>
                     <w:numId w:val="5"/>
                  </w:numPr>
                  <w:spacing w:before="80" w:after="80"/>
                  <w:contextualSpacing w:val="0"/>
                  <w:rPr>
                     <w:rFonts w:ascii="Franklin Gothic Book" w:hAnsi="Franklin Gothic Book"/>
                     <w:bCs/>
                     <w:sz w:val="20"/>
                     <w:szCs w:val="20"/>
                  </w:rPr>
               </w:pPr>
               <w:r w:rsidRPr="00DB3D99">
                  <w:rPr>
                     <w:rFonts w:ascii="Franklin Gothic Book" w:hAnsi="Franklin Gothic Book"/>
                     <w:bCs/>
                     <w:sz w:val="20"/>
                     <w:szCs w:val="20"/>
                  </w:rPr>
                  <w:t>Mainstreaming environmental considerations into social and economic decisions at all levels is of vital importance</w:t>
               </w:r>
            </w:p>

 ...
            <w:p w14:paraId="79FEF50C"
                 w14:textId="65464CBE"
                 w:rsidR="009C1A5F"
                 w:rsidRPr="009C1A5F"
                 w:rsidRDefault="009C1A5F"
                 w:rsidP="009C1A5F">
               <w:pPr>
                  <w:pStyle w:val="ListParagraph"/>
                  <w:keepNext/>
                  <w:numPr>
                     <w:ilvl w:val="0"/>
                     <w:numId w:val="9"/>
                  </w:numPr>
                  <w:spacing w:before="80" w:after="80"/>
                  <w:rPr>
                     <w:rFonts w:ascii="Franklin Gothic Book" w:hAnsi="Franklin Gothic Book"/>
                     <w:sz w:val="20"/>
                     <w:szCs w:val="20"/>
                  </w:rPr>
               </w:pPr>
               <w:r w:rsidRPr="009C1A5F">
                  <w:rPr>
                     <w:rFonts w:ascii="Franklin Gothic Book" w:hAnsi="Franklin Gothic Book"/>
                     <w:bCs/>
                     <w:sz w:val="20"/>
                     <w:szCs w:val="20"/>
                  </w:rPr>
                  <w:t>Solutions need to seek an integrated approach that simultaneously address the conservation of the planet’s genetic diversity, species and ecosystems</w:t>
               </w:r>
            </w:p>

numbering.xml

<w:abstractNum w:abstractNumId="0" w15:restartNumberingAfterBreak="0">
      <w:nsid w:val="037970D6"/>
      <w:multiLevelType w:val="hybridMultilevel"/>
      <w:tmpl w:val="98A2E35C"/>
      <w:lvl w:ilvl="0" w:tplc="E7067EF0">
         <w:start w:val="1"/>
         <w:numFmt w:val="bullet"/>
         <w:lvlText w:val=""/>
         <w:lvlJc w:val="left"/>
         <w:pPr>
            <w:ind w:left="360" w:hanging="360"/>
         </w:pPr>
         <w:rPr>
            <w:rFonts w:ascii="Wingdings 2" w:hAnsi="Wingdings 2" w:hint="default"/>
         </w:rPr>
      </w:lvl>
   ...
   </w:abstractNum>
...
   <w:abstractNum w:abstractNumId="8" w15:restartNumberingAfterBreak="0">
      <w:nsid w:val="6DA523B5"/>
      <w:multiLevelType w:val="hybridMultilevel"/>
      <w:tmpl w:val="D0A2943E"/>
      <w:lvl w:ilvl="0" w:tplc="CBCE2CF0">
         <w:start w:val="1"/>
         <w:numFmt w:val="bullet"/>
         <w:lvlText w:val=""/>
         <w:lvlJc w:val="left"/>
         <w:pPr>
            <w:ind w:left="360" w:hanging="360"/>
         </w:pPr>
         <w:rPr>
            <w:rFonts w:ascii="Wingdings 2" w:hAnsi="Wingdings 2" w:hint="default"/>
         </w:rPr>
      </w:lvl>
   ...
   </w:abstractNum>
...
   <w:num w:numId="5" w16cid:durableId="963343858">
      <w:abstractNumId w:val="0"/>
   </w:num>
   ...
   <w:num w:numId="9" w16cid:durableId="324748400">
      <w:abstractNumId w:val="8"/>
   </w:num>

styles.xml

<w:style w:type="paragraph" w:styleId="ListParagraph">
  <w:name w:val="List Paragraph"/>
  <w:basedOn w:val="Normal"/>
  <w:link w:val="ListParagraphChar"/>
  <w:uiPriority w:val="34"/>
  <w:qFormat/>
  <w:rsid w:val="007205D3"/>
  <w:pPr>
     <w:ind w:left="720"/>
     <w:contextualSpacing/>
  </w:pPr>
</w:style>

To show my actual implementation of XPath, I am actually attempting XSLT that transforms document.xml (making document reference to numbering.xml) using PowerShell to identify all text and symbol of bullet points in output.

style.xsl

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                              xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
 <xsl:output encoding="UTF-8" omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
    <data>
        <xsl:apply-templates select="descendant::w:tbl"/>
    </data>
 </xsl:template>

 <xsl:template match="w:tbl">
    <xsl:apply-templates select="descendant::w:p[descendant::w:t != '']"/>
 </xsl:template>

 <xsl:template match="w:p">
    <xsl:variable name="num_id" select="w:pPr/w:numPr/w:numId/@w:val"/>
    <xsl:variable name="lvl_id" select="w:pPr/w:numPr/w:ilvl/@w:val"/>
    <xsl:variable name="abs_id" select="document('numbering.xml')/w:numbering/
                                         w:num[@w:numId = $num_id]/w:abstractNumId/@w:val" />
    <xsl:variable name="num_val" select="document('numbering.xml')/w:numbering/
                                         w:abstractNum[@w:abstractNumId = $abs_id]/
                                         w:lvl[@w:ilvl = $lvl_id]/w:lvlText/@w:val"/>
    <xsl:variable name="square_bullet"><![CDATA[&#61569;]]></xsl:variable>
    <xsl:variable name="circle_bullet"><![CDATA[&#61603;]]></xsl:variable>
    <row>
        <text>
            <xsl:value-of select="."/>
        </text>
        <symbol>
            <xsl:value-of select="$num_val"/>
        </symbol>
        <type>
            <xsl:choose>
                <xsl:when test="$num_val = $square_bullet">
                    <xsl:text>Checkbox</xsl:text>
                </xsl:when>
                <xsl:when test="$num_val = $circle_bullet">
                    <xsl:text>Radio</xsl:text>
                </xsl:when>
                <xsl:otherwise>Text</xsl:otherwise>
            </xsl:choose>
        </type>
    </row>
 </xsl:template>

 <xsl:template match="text()">
    <xsl:value-of select='normalize-space()'/>
 </xsl:template>
 
</xsl:stylesheet>

transform.ps1

$xslSettings = New-Object System.Xml.Xsl.XsltSettings($true, $false);
$xmlResolver = New-Object System.Xml.XmlUrlResolver;

$xslt = New-Object System.Xml.Xsl.XslCompiledTransform;

$xslt.Load("style.xsl", $xslSettings, $xmlResolver);
$xslt.Transform("document.xml", "output.xml");

output.xml

<data xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <row>
    <text>Mainstreaming environmental considerations into social and economic decisions at all levels is of vital importance</text>
    <symbol></symbol>
    <type>Text</type>
  </row>
  <row>
    <text>Solutions need to seek an integrated approach that simultaneously address the conservation of the planet’s genetic diversity, species and ecosystems</text>
    <symbol></symbol>
    <type>Text</type>
  </row>
</data>
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Are you sure that the character inside both ` ` is the same in Webdings 2? – Siebe Jongebloed Aug 11 '23 at 20:55
  • StackOverflow is not properly rendering that character which is a square symbol with no qmark. These are exact copied snippets from docx XML files after pretty print. In highlighting the bullet points in Word document, the Font bar shows _Wingdings 2_. – Parfait Aug 11 '23 at 22:09
  • I will try to reproduce your usecase, since it seems you are doing everything correctly. – Siebe Jongebloed Aug 11 '23 at 22:32
  • 1
    @Parfait AFAIK Word stores Windings 2 characters using Unicode private use encodings, i.e. a code that (say) the Windows character map shows as 182 = 0xB6 is mapped to 0xF0B6. That means you probably have to *unmap* them before you can do anything useful with the code. It's possible that symbols with a recognized Unicode code point are not mapped into a private use area, so your code may need to make some decisions based on the code point it finds. – jonsson Aug 12 '23 at 12:30
  • @jonsson, thank you for your note. Is there anything in the docx files to help with that unmapping. I added to my OP, my actual implementation with XSLT using the html entities a posted answer provided via links but to no avail. The raw docx files only show a square symbol and not any unicode, entity, etc. How does Word.exe know to translate the two private use encodings using these XML files? Did unzipping the docx affect rendering? – Parfait Aug 12 '23 at 20:04
  • @Parfait Think Siebe Jongebloed's modified answer should tell you enough to make some progress – jonsson Aug 13 '23 at 16:43

1 Answers1

2

In your example the <w:lvlText w:val=""/> <w:lvlText w:val=""/> are in xml visible looking the same, but they are not.

The first one <w:lvlText w:val=""/> holds U+F081

The second one <w:lvlText w:val=""/> holds U+F0A3

If I put both with <w:rFonts w:ascii="Wingdings 2" w:hAnsi="Wingdings 2" w:hint="default"/> in the appropriate w:abstractNum/w:lvl/w:rPr I get your result as well

So to conclude; in font Wingdings 2 these chars U+F081 and U+F0A3 are pointing to een open circle and open square.

And so is your XPath strategy towards these characters correct.

EDIT These special characters may appear in xml as some form of rectangular shape...but that is just a way of displaying undisplayable characters.
In i.e. BBEdit (on MacOs) you have the option to view the bytes of a text-file as HEX-codes. In this way you are able to view the private unicodes See i.e. this question for some more info on the way Windows handles private unicodes.

I don't know if it is possible to display those actual bullets inside xml using xslt, since it is a combination between a font and unicode. I suppose you would need to format it using i.e. css to actually show it correctly.

Siebe Jongebloed
  • 3,906
  • 2
  • 14
  • 19
  • Thank you for your answer! While informative, I am not sure how to use it for my use case. (Curiously, how did you find those private use unicodes? Is it made available inside the docx files?) I added to my OP, my actual implementation with XSLT using the html entities your links show but to no avail since numbering.xml shows the visual square symbol and not entity value. – Parfait Aug 12 '23 at 19:57
  • I will give some more info in my answer. – Siebe Jongebloed Aug 13 '23 at 08:31
  • That was it: *combination between a font and unicode*! In Notepad++, I changed the font in numbering.xml temporarily to *Wingdings 2* and actually saw the open circle and open square symbols which I copied over to the XSLT variables to render output as expected! Thanks for your help! – Parfait Aug 13 '23 at 19:17