4

I want to read header text from a docx file in Python. I am using python-docx module.

Can someone help me to do this, if this feature has already been implemented.

I tried to do this this way, but didn't get a success.

from docx import Document

document = Document(path)
section = document.sections[0]
print(section.text)

Error:
    <class 'AttributeError'>'Section' object has no attribute 'text'

And:

from docx import Document

document = Document(path)
header = document.sections[0].header
print(header.text)

Error:
    <class 'AttributeError'>'Section' object has no attribute 'header'
Dipas
  • 294
  • 2
  • 9
  • 21
  • What do you mean header? Also please note, that `section` is mostly related to the page properties (height, width, margins etc). In order to fetch some text you have to deal with `paragraphs` check this documentation https://python-docx.readthedocs.io/en/latest/user/text.html – machin Jan 15 '18 at 16:51
  • Also check this answer https://stackoverflow.com/questions/40388763/extracting-headings-text-from-word-doc/40392237#40392237 – machin Jan 15 '18 at 16:55
  • Check out [docx2text](https://github.com/ankushshah89/python-docx2txt). This will read text from headers and footers. – probat Jun 28 '18 at 15:09

2 Answers2

5

At the time you asked your question, this wasn't possible using the python-docx library. In the 0.8.8 release (January 7, 2019), header/footer support was added.

In a Word document, each section has a header. There's a lot of potential wrinkles to headers (e.g. they can be linked from section to section or different on even/odd pages), but in the simple case, with one section and a non-complicated header, you just need to go through the paragraphs in the section header.

from docx import Document
document = Document(path_and_filename)
section = document.sections[0]
header = section.header
for paragraph in header.paragraphs:
    print(paragraph.text) # or whatever you have in mind

I'm working with a document that has the header laid out with a table instead of simple text. In that case, you'd need to work with the rows in header.tables[0] instead of the paragraphs.

Matt Schouten
  • 141
  • 2
  • 3
1

Here's another simple way to get text from the headers and footers of your docx.

import docx2python as docx2python
from docx2python.iterators import iter_paragraphs

doc = docx2python('file.docx')
header_text = '\n\n'.join(iter_paragraphs(doc.header))
footer_text = '\n\n'.join(iter_paragraphs(doc.footer))

docx2python will extract any images in your headers and footers. In your header and footer text, these will be replaced with ----image1.png----.

Shay
  • 1,368
  • 11
  • 17
  • 4
    While this code may answer the question, providing additional context regarding why and/or how this code answers the question improves its long-term value. – Nic3500 Jul 10 '19 at 23:05