How to extract text data in a table created in a docx document

Question

I would like to extract text from docx document, I come up with a script extracting text from docx document but I noticed that some document have table and the script do not work on them, How can I improve the above script :


import glob
import os

import docx

with open('your_file.txt', 'w') as f:
    for directory in glob.glob('fi*'):
        for filename in glob.glob(os.path.join(directory, "*")):
            if filename.endswith((".docx", ".doc")):
                document = docx.Document(filename)    
                for paragraph in document.paragraphs:
                    if paragraph.text:
                        #docText.append(paragraph.text)
                        f.write("%s\n" % paragraph.text)

docx with table

Does this answer your question? [python -docx to extract table from word docx](https://stackoverflow.com/questions/46618718/python-docx-to-extract-table-from-word-docx) — Jongware, Jan 29 '20 at 15:58
What does _the script do not work on them_ mean? What part are you struggling with, exactly? Stack Overflow is not a provider of free software solutions. — AMC, Jan 29 '20 at 18:29

solarflare · Accepted Answer · 2020-01-29T16:11:51.910

4

Try using python-docx module instead

pip install python-docx

import docx

doc = docx.Document("document.docx")

for table in doc.tables:
    for row in table.rows:
        for cell in row.cells:
            print(cell.text)

edited Jan 29 '20 at 16:11

answered Jan 29 '20 at 16:06

solarflare

880
2
8
23

thank you; It seems to work; I will use it and try to improve according to the new document I will preprocess – kely789456123 Jan 30 '20 at 15:54

How to extract text data in a table created in a docx document

1 Answers1