1

Every day I receive an email with a word document. All text in the document exists in a table in the document. I have hundreds of these word documents (I get one every day). I want to use python to open each document, copy the text that I need, and paste it into an excel document. However, I am getting stuck on the very first part. I can't pull the text from the word document. I am trying to use python-docx module to pull the text, but I can't figure out how to read the text from the tables.

I modified a getText module in the python intro book I am reading, but it doesn't seem to work. Am I even on the right track here?

import docx
fullText = []

def getText(filename):
    doc = docx.Document(filename)
    for table in doc.Tables:
        for row in table.Rows:
            for cell in row.Cells:
                  fullText.append(cell.text)
    return '\n'.join(fullText)

Okay, after looking at this other question I have realized that I am actually having a different problem than I thought. I have made changes and have the following code:

import docx
fullText = []

doc = docx.Document('c:\\btest\\January18.docx')
for table in doc.tables:
    for row in table.rows:
            for cell in row.cells:
                  fullText.append(cell.text)
'\n'.join(fullText)

print(fullText)

it is printing out this:

['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

The thing is, the tables in the word document are not blank cells, and so they should not be returning blank. What am I doing wrong?

A sample input document is here

I am trying to pull certain text rows out of this document, and pasting and formatting the text the way I want. However, I can't even access the text in the word document...

Community
  • 1
  • 1
Neurofro
  • 21
  • 6
  • 2
    "Doesn't seem to work" - how do you know? Do you get *anything* at all? `len(doc.Tables)` for example. – Jongware Dec 25 '18 at 23:44
  • I get an AttributeError: 'Document' object has no attribute 'Tables' error. message. I get this with the code I shared in the op and I get it with the len(do.Tables) line as well – Neurofro Dec 26 '18 at 00:28
  • Possible duplicate of [python -docx to extract table from word docx](https://stackoverflow.com/questions/46618718/python-docx-to-extract-table-from-word-docx) – Jongware Dec 26 '18 at 01:00
  • Post a sample input that reproduces the problem with your code. Generate a sample document, don't post anything proprietary – Mad Physicist Dec 26 '18 at 01:43
  • @MadPhysicist I have posted a link to the sample document above. Thank you. – Neurofro Dec 26 '18 at 03:37
  • Just to clarify, does that document contain information we shouldn't be seeing? It looks like a lot of personal information. – Mad Physicist Dec 26 '18 at 04:40
  • all of the information is public information. the document is sent out to the public, just nobody knows how to make use of it. that is what I am trying to do. – Neurofro Dec 26 '18 at 14:45

1 Answers1

2

I was able to parse sample doc and save it to Excel file with the following script:

import re
import pandas
import docx2txt

INPUT_FILE = 'jantest2.docx'
OUTPUT_FILE = 'jantest2.xlsx'

text = docx2txt.process(INPUT_FILE)
results = re.findall(r'(\d+-\d+)\n\n(.*)\n\n(.*)\n\n(.*)', text)
data = {'Case Number': [x[0] for x in results],
        'Report Date': [x[1] for x in results],
        'Address': [x[2] for x in results],
        'Statute Descripiton': [x[3] for x in results]}

data_frame = pandas.DataFrame(data=data)
writer = pandas.ExcelWriter(OUTPUT_FILE)
data_frame.to_excel(writer, 'Sheet1', index=False)
writer.save()

So here what I've got in Excel file:

enter image description here

Alderven
  • 7,569
  • 5
  • 26
  • 38
  • That is great. However, when I try to do it for the full file, not the sample, I get: – Neurofro Dec 26 '18 at 17:47
  • Traceback (most recent call last): File "C:/Monty/testingbrev.py", line 16, in data_frame = pandas.DataFrame(data=data) File "C:\Monty\lib\site-packages\pandas\core\frame.py", line 348, in __init__ mgr = self._init_dict(data, index, columns, dtype=dtype) File "C:\Monty\lib\site-packages\pandas\core\frame.py", line 459, in _init_dict return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype) File "C:\Monty\lib\site-packages\pandas\core\frame.py", line 7356, in _arrays_to_mgr index = extract_index(arrays) – Neurofro Dec 26 '18 at 17:48
  • File "C:\Monty\lib\site-packages\pandas\core\frame.py", line 7402, in extract_index raise ValueError('arrays must all be same length') ValueError: arrays must all be same length – Neurofro Dec 26 '18 at 17:48
  • Is it possible for you to find out which part of the data produces that error and send me this piece of data? – Alderven Dec 26 '18 at 18:32
  • I can try but, it might be a while, the full document is hundreds of pages long. I will see what I can do. – Neurofro Dec 27 '18 at 03:23
  • I have narrowed the problem down to a single character that appears in some of the addresses. Some of the addresses are intersections and so they are stored as "EAST / WEST" for example. If I take out the forward slash, the error goes away. I added a / to the regex and that first problem went away. – Neurofro Dec 27 '18 at 04:22
  • I keep running into the same error though, in different parts. Ultimately the problem looks like the regex aren't always catching every iteration, and so the array's are not the same size for each of the keys. – Neurofro Dec 27 '18 at 04:22
  • I have done some more looking, and troubleshooting the regex problems. I have identified every difference that may occur, and where possible modified the regex to fit the possibilities. I am down to two possibilities I cannot figure out how to account for, both of which happen to be in the address regex. Sometimes there is not an address supplied. This causes an issue for obvious reasons. Other times there is a hyphen in the address. This could be an easy fix, but when I add a hyphen to the regex, it starts picking up the case number's as well. Any ideas on how to account for those? – Neurofro Dec 27 '18 at 05:08
  • I've made script more robust. Try it out against your data. – Alderven Dec 27 '18 at 05:28
  • It works perfectly. I see what you did. You set up the regex to have four different groups. the first group follows the format for the case number (4 numbers - more numbers) then it grabs all of the text in each of the next three columns and puts that text into groups (even if there is no text it is just a blank group). I was starting to think in that direction last night, and you did it perfectly. Thank you so much! – Neurofro Dec 27 '18 at 14:31