Retaining paragraph numbering in docx using the R officer package

Question

How can I retain the numbering of paragraphs when extracting text from a docx file?

I'm doing some NLP-ML work on a bunch of docx files, and to begin with I need to break up each doc into a dataframe. I'm working with contracts, such that almost every paragraph is numbered, e.g most of the text I'm dealing with looks like this:

1.17. The Agent will provide the attendant resources as set out in Annex 3 bla bla bla

1.17.1. The Agent will ensure that attendant resources are bla bla bla

1.18. An indicative Authority resource profile is set out in bla bla bla.

etc

The docx_summary() of the officer package function lays out the text in a dataframe wonderfully, except that it doesn't retain the paragraph numbering. The result is that I get a dataframe where the text looks like this:

The Agent will provide the attendant resources as set out in Annex 3 bla bla bla

The Agent will ensure that attendant resources are bla bla bla

An indicative Authority resource profile is set out in bla bla bla.

I guessed this is to do with how Word defines numbering as a Style rather than plain text, and I can see in the docx_summary() output, the $style_name variable has Headings 1 through 4 according to the numbering hierarchy in the docx. But I can't figure out how to extract the actual numbering and apply it to each paragraph in the docx_summary outputted dataframe.

The output I want is the same docx_summary() dataframe, but with an added numbering column, to look like this:

output_df <- data.frame(content_type = "paragraph", style_name = "heading 2", numbering = "1.17", text = "The Agent will provide the attendant resources as set out in Annex 3 bla bla bla")

> output_df
  content_type style_name numbering text
1    paragraph  heading 2      1.17 The Agent will provide the attendant resources as set out in Annex 3 bla bla bla

Any help would be much appreciated.

The only solution I have found till the moment is to manually run the following VBA script: https://answers.microsoft.com/en-us/msoffice/forum/all/is-there-a-way-to-remove-bullets-but-retain/06b677be-63a5-4f07-9dc2-6a66c46b7e7f — atsyplenkov, Aug 08 '22 at 18:47

Retaining paragraph numbering in docx using the R officer package

0 Answers0