Getting text from doc and docx

Question

I'm using a computer with Windows 7 and python 3.3 installed on it. At my organization we have thousands of documents which are not organized. I want to create a program that opens doc/docx files, searches the text for certain keywords and then rearranges the documents. I'm looking for a way to search the text of a word file (doc/docx) for certain words, it has to be on Windows, and it has to be able to search both doc and docx.

Any ideas?

score 0 · Answer 1 · answered Apr 25 '17 at 12:27

0

a .docx document is a Zip archive in OpenXML format: you have first to uncompress it.

After this you can run:

# Import the module
from docx import *

# Open the .docx file
document = opendocx('A document.docx')

# Search returns true if found    
search(document,'your search string')

answered Apr 25 '17 at 12:27

XzibitGG

330
5
18

1

I have thousands of documents, I can't uncompress every single one of them, it's not practical. – matan ben simon Apr 25 '17 at 12:40
1

But it's not dealing with doc :-( – matan ben simon Apr 25 '17 at 14:02

score 0 · Answer 2 · answered Apr 25 '17 at 14:16

0

One can use the textract library. It take care of both "doc" as well as "docx"

import textract
text = textract.process("path/to/file.extension")

You can even use 'antiword' (sudo apt-get install antiword) and then convert doc to first into docx and then read through docx2txt.

antiword filename.doc > filename.docx Ultimately, textract in the backend is using antiword.

answered Apr 25 '17 at 14:16

XzibitGG

330
5
18

1

the installation goes through, and fails at the end... :-( appears it doesn't work on python 3.3 :-( – matan ben simon Apr 25 '17 at 18:24
Can you send a screenshot? – XzibitGG Apr 25 '17 at 20:21
when installing through command promp pip install textract it says "build failed" in red, and then some more lines in red and closes the window...too fast for me to screenshot XD is there a way to freeze it? – matan ben simon Apr 25 '17 at 20:23
Great Library but installation doesn't go through Python 3.3 – SVK Feb 23 '18 at 20:22
Installation goes through brew. For those interested, you can also convert a .doc file into pdf preserving the images, then extract text with OCR using pytesseract. – linello May 22 '20 at 07:49

Getting text from doc and docx

2 Answers2