-1

I want to find a specific regex in a docx document. I installed python-docx and I can find strings in my text. However, I want to use regular expressions.

So far my code is:

import re
from docx import Document
doc = Document('categoriemanzoni.docx')
match = re.search(r"\[(['prima']+(?!\S))", doc)

for paragraph in doc.paragraphs:
    paragraph_text = paragraph.text
    if match in paragraph.text:
        print('ok')

To me, it seems also that it doesn't read all paragraphs. How to fix it?

Anna
  • 369
  • 2
  • 10
  • 1
    I'm almost sure that your regex is not doing what you want it to. Do you mean to match a literal `[`, followed by one or more of the characters `a`, `i`, `m`, `p`, `r` or `'`? – Tim Pietzcker Mar 14 '20 at 11:40
  • I want to find [ and "prima" and " " (whitespace) like this: [prima ; all together – Anna Mar 14 '20 at 11:41
  • Your use of `re` itself is also pretty much mangled beyond recognition . You cannot use `re.match` on a `Document` type (and it is in fact surprising you don't get an error on that). You probably meant something like [`re.compile`](https://docs.python.org/3/library/re.html?highlight=re.compile#re.compile) there at the top. But still: `match in xxx` is *not* how that works. – Jongware Mar 14 '20 at 11:49

2 Answers2

2

Your code is applying the regex (which itself is faulty) at the wrong place. You probably want something like this:

import re
from docx import Document
doc = Document('categoriemanzoni.docx')
regex = re.compile(r"\[prima(?!\S)")

for paragraph in doc.paragraphs:
    if regex.search(paragraph.text):
        print('ok')
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • yeah! but if I run this code it doesn't find anything, almost like it doesn't read the whole document... – Anna Mar 14 '20 at 11:49
  • @Anna: probably because your regex is bad. Try with a simpler string and you will see it works. – Jongware Mar 14 '20 at 12:16
  • 1
    I haven't used the docx module, so there may be something wrong with this approach. How about doing a `print(paragraph.text)` for each paragraph and see if it really contains what you think it contains. – Tim Pietzcker Mar 14 '20 at 12:40
  • True, my error was that there were tables and it didn't read the text in them – Anna Mar 14 '20 at 13:09
0
import docx2txt
test_doc = docx2txt.process('story.docx')
docu_Regex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = docu_Regex.findall(test_doc)
print(mo)

I used this as an example. It worked the way I needed it to.

4b0
  • 21,981
  • 30
  • 95
  • 142
  • 2
    I doubt that this helps, or even works at all. To convince me otherwise please exlain the code. How does it work and why is that supposed to help? – Yunnosch Aug 31 '20 at 08:29