How to use regular expressions with python docx?

Question

I want to find a specific regex in a docx document. I installed python-docx and I can find strings in my text. However, I want to use regular expressions.

So far my code is:

import re
from docx import Document
doc = Document('categoriemanzoni.docx')
match = re.search(r"\[(['prima']+(?!\S))", doc)

for paragraph in doc.paragraphs:
    paragraph_text = paragraph.text
    if match in paragraph.text:
        print('ok')

To me, it seems also that it doesn't read all paragraphs. How to fix it?

I'm almost sure that your regex is not doing what you want it to. Do you mean to match a literal `[`, followed by one or more of the characters `a`, `i`, `m`, `p`, `r` or `'`? — Tim Pietzcker, Mar 14 '20 at 11:40
I want to find [ and "prima" and " " (whitespace) like this: [prima ; all together — Anna, Mar 14 '20 at 11:41
Your use of `re` itself is also pretty much mangled beyond recognition . You cannot use `re.match` on a `Document` type (and it is in fact surprising you don't get an error on that). You probably meant something like [`re.compile`](https://docs.python.org/3/library/re.html?highlight=re.compile#re.compile) there at the top. But still: `match in xxx` is *not* how that works. — Jongware, Mar 14 '20 at 11:49

score 2 · Answer 1 · answered Mar 14 '20 at 11:48

2

Your code is applying the regex (which itself is faulty) at the wrong place. You probably want something like this:

import re
from docx import Document
doc = Document('categoriemanzoni.docx')
regex = re.compile(r"\[prima(?!\S)")

for paragraph in doc.paragraphs:
    if regex.search(paragraph.text):
        print('ok')

answered Mar 14 '20 at 11:48

Tim Pietzcker

328,213
58
503
561

yeah! but if I run this code it doesn't find anything, almost like it doesn't read the whole document... – Anna Mar 14 '20 at 11:49
@Anna: probably because your regex is bad. Try with a simpler string and you will see it works. – Jongware Mar 14 '20 at 12:16
1

I haven't used the docx module, so there may be something wrong with this approach. How about doing a `print(paragraph.text)` for each paragraph and see if it really contains what you think it contains. – Tim Pietzcker Mar 14 '20 at 12:40
True, my error was that there were tables and it didn't read the text in them – Anna Mar 14 '20 at 13:09

score 0 · Answer 2 · edited Aug 31 '20 at 10:36

0

import docx2txt
test_doc = docx2txt.process('story.docx')
docu_Regex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = docu_Regex.findall(test_doc)
print(mo)

I used this as an example. It worked the way I needed it to.

edited Aug 31 '20 at 10:36

4b0

21,981
30
95
142

answered Aug 31 '20 at 08:10

Daniel A. Morales

11
2

2

I doubt that this helps, or even works at all. To convince me otherwise please exlain the code. How does it work and why is that supposed to help? – Yunnosch Aug 31 '20 at 08:29

How to use regular expressions with python docx?

2 Answers2