3

How can I search a word document using python to extract the paragraph text after searching and matching the paragraph heading i.e. "1.2 Summary of Broadspectrum Offer".

i.e. see below for a doc example, i basically would like to get the following text "A summary of our Offer to deliver the Scope of Work as outlined in the tender documents is provided below. Please refer to the various terms and conditions of our Offer as detailed herein. Please also find the cost breakdown "

1.  Executive Summary

1.1 Summary of Services
Energy Savings (Carbon Emissions and Intensity Reduction)
Upgrade Economy Cycle on Level 2,5,6,7 & 8, replace Chilled Water Valves on Level 6 & 8 and install lighting controls on L5 & 6..

1.2 Summary of Broadspectrum Offer

A summary of our Offer to deliver the Scope of Work as outlined in the tender documents is provided below. Please refer to the various terms and conditions of our Offer as detailed herein.
Please also find the cost breakdown 

note that the headings number change from doc to doc and do not want to rely on this, more so i want to rely on the search text in the heading

so far i can search the documents but just a start.

filename1 = "North Sydney TE SP30062590-1 HVAC - Project Offer -  Rev1.docx"

from docx import Document

document = Document(filename1)
for paragraph in document.paragraphs:
    if 'Summary' in paragraph.text:
        print paragraph.text
Ossama
  • 2,401
  • 7
  • 46
  • 83
  • Will your document ever have anything after the `1.2 Summary ...` paragraph? And will `Summary of Broadspectrum Offer` always be labled with `1.2`? – sadmicrowave Oct 05 '17 at 12:30
  • You should use the re library to write regex expressions. There is extensive info about it around SO and the web. – Anton vBR Oct 05 '17 at 12:55
  • Maybe this can help: https://stackoverflow.com/questions/40388763/extracting-headings-text-from-word-doc – Anton vBR Oct 05 '17 at 12:57

1 Answers1

4

Here's a preliminary solution (pending answers to my comments on your post above). This does not yet account for exclusion of additional paragraphs after the Summary of Broadspectrum Offer section. If that is needed, you will most likely need a small regex match to figure out if you've encountered another header section with a 1.3(etc.) and stop the comprehension if so. Let me know if this is a requirement.

Edit: converted the print() from list comprehension method to standard for loop, in response to Anton vBR's comment below.

from docx import Document

document = Document("North Sydney TE SP30062590-1 HVAC - Project Offer -  Rev1.docx")

# Find the index of the `Summary of Broadspectrum Offer` syntax and store it
ind = [i for i, para in enumerate(document.paragraphs) if 'Summary of Broadspectrum Offer' in para.text]
# Print the text for any element with an index greater than the index found in the list comprehension above
if ind:
    for i, para in enumerate(document.paragraphs):
        if i > ind[0]:
             print(para.text)    

[print(para.text) for i, para in enumerate(document.paragraphs) if ind and i > ind[0]]

>> A summary of our Offer to deliver the Scope of Work as outlined in the tender documents is provided below. 
Please refer to the various terms and conditions of our Offer as detailed herein.
Please also find the cost breakdown 

Also, here is another post that may help solution another approach, which is to detect a heading type using paragraph metadata: Extracting headings' text from word doc

sadmicrowave
  • 39,964
  • 34
  • 108
  • 180
  • The rows you posted are long and using print() inside a list comprehension isn't really recommended. – Anton vBR Oct 05 '17 at 12:53
  • I removed the list comprehension with the `print()` function. It was simply a clean one liner to print what was desired, but you are right, it isn't the best practice. – sadmicrowave Oct 05 '17 at 12:59