Extraction of text page by page from MS word docx file using python

Question

I have a MS docx file and I need to extract text from it page-wise. I have tried with python-docx but it could extract the whole text but not pagewise. I have also converted my docx to pdf and then tried text extraction. The problem is, after conversion the page structure of docx got changed. For example, while converted,the font size got changed and the text content in one page of docx took more than one page in the pdf.

I was looking for a stable solution that would extract page wise text from docx (Without converting to pdf would be better for my whole solution). Can somebody help me on this?

score 6 · Answer 1 · answered Dec 18 '19 at 07:18

It seems to me that the docx format (and therefore also the python docx library) only supports paragraphs and sections.

Microsoft Word does not support the concept of hard pages. Instead, when the exported document is opened in Word, Word repaginates it again based on the page size. (source)

So in fact the pagination is not stored in the docx file, but rather carried out by the rendering engine:

DOCX files contain no information about pagination. You won’t find the number of pages in the document unless you calculate how much space you need for each line to ascertain the number of pages. (source)

This page has some more background and recommends to use PDF if pagination must be kept.

score 3 · Answer 2 · answered Apr 18 '21 at 05:11

I faced a similar scenario recently. The following using docx2python worked for me:

from docx2python import docx2python
doc_result = docx2python('page-wise-file.docx')
count = 0
para = 0
pages= []
while para < len(doc_result.body[0][0][0]):
    if doc_result.body[0][0][0][para] != "":
        current_page = {}
        current_page_paras = []
        count+=1
        while doc_result.body[0][0][0][para]!= "" and para<len(doc_result.body[0][0][0]):
            current_page_paras.append(doc_result.body[0][0][0][para])
            para+=1
        current_page["page_text"] = "\n".join(current_page_paras)
        current_page["page_no"] = count
        pages.append(current_page)
    else:
        para+=1

Although this will lead to losing any formatting information or any other metadata from the text, if extracting text is the only aim then this should work.

As Gerd mentioned, converting the file to PDF and then processing it can also help since libraries like PyPDF2 allow you to read individual pages, for example:

from PyPDF2 import PdfFileReader
pdf = PdfFileReader(open("page-wise-file.pdf", "rb"))
page = pdf.getPage(0)
page.extractText()

score 0 · Accepted Answer · answered Jan 09 '20 at 06:49

I found that Tika library had a xmlContent parsing when reading the file. I used it to capture xml format and used regex to capture it. Writing below the python code that worked for me.

raw_xml = parser.from_file(file, xmlContent=True)
body = raw_xml['content'].split('<body>')[1].split('</body>')[0]
body_without_tag = body.replace("<p>", "").replace("</p>", "").replace("<div>", "").replace("</div>","").replace("<p />","")
text_pages = body_without_tag.split("""<div class="page">""")[1:]
num_pages = len(text_pages)
if num_pages==int(raw_xml['metadata']['xmpTPg:NPages']) : #check if it worked correctly
     return text_pages

This solution works well for PDF, not for docx. – Stefano Fiorucci - anakin87 Mar 29 '22 at 08:08 — Stefano Fiorucci - anakin87, Mar 29 '22 at 08:08

score -2 · Answer 4 · answered Apr 02 '22 at 07:43

import win32com.client
import comtypes.client
import pdfplumber
word = win32com.client.Dispatch('Word.Application')
wdFormatPDF = 17
in_file = Filepath
out_file = "out.pdf"
word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
with pdfplumber.open(out_file) as pdf:       
    for page in pdf.pages:
        out=page.extract_text()            
        print(out)

As far as I know, saving a pdf file with win32com is a 1:1 fork

score -4 · Answer 5 · answered Dec 18 '19 at 05:23

-4

try this


from docx import Document

document = Document('anydoccumnet.docx')
for para in document.paragraphs:
    print(para.text)

answered Dec 18 '19 at 05:23

Debi

18
6

2

I tried this also and it will give all paragraphs in the whole document, but not page wise. I am trying to get text page by page – AlfiyaFaisy Dec 18 '19 at 06:15

Extraction of text page by page from MS word docx file using python

5 Answers5

Linked