How can I retain the numbering of paragraphs when extracting text from a docx file?
I'm doing some NLP-ML work on a bunch of docx files, and to begin with I need to break up each doc into a dataframe. I'm working with contracts, such that almost every paragraph is numbered, e.g most of the text I'm dealing with looks like this:
1.17. The Agent will provide the attendant resources as set out in Annex 3 bla bla bla
1.17.1. The Agent will ensure that attendant resources are bla bla bla
1.18. An indicative Authority resource profile is set out in bla bla bla.
etc
The docx_summary()
of the officer
package function lays out the text in a dataframe wonderfully, except that it doesn't retain the paragraph numbering. The result is that I get a dataframe where the text looks like this:
The Agent will provide the attendant resources as set out in Annex 3 bla bla bla
The Agent will ensure that attendant resources are bla bla bla
An indicative Authority resource profile is set out in bla bla bla.
I guessed this is to do with how Word defines numbering as a Style rather than plain text, and I can see in the docx_summary()
output, the $style_name
variable has Headings 1 through 4 according to the numbering hierarchy in the docx. But I can't figure out how to extract the actual numbering and apply it to each paragraph in the docx_summary
outputted dataframe.
The output I want is the same docx_summary()
dataframe, but with an added numbering column, to look like this:
output_df <- data.frame(content_type = "paragraph", style_name = "heading 2", numbering = "1.17", text = "The Agent will provide the attendant resources as set out in Annex 3 bla bla bla")
> output_df
content_type style_name numbering text
1 paragraph heading 2 1.17 The Agent will provide the attendant resources as set out in Annex 3 bla bla bla
Any help would be much appreciated.