Docx Python - Read line by line

Question

I have the following text in a word file.. I am trying to read the text line by line, check whether the last word is hyphenated, if it is hyphenated then join the last word of the current sentence and first word of the previous sentence without hyphenation....

*The protection of anticipatory or pre-arrest bail cannot be limited to any time-

frame or “fixed period” as denial of bail amounts to deprivation of the funda-

mental right to personal liberty in a free and democratic country, a Consti-

tution Bench of the Supreme Court ruled on Wednesday.*

expectation output: timeframe fundamental constitution

Python Docx has options only to read entire paragraph and not lines..

Is there a way to do it in Python??? Can someone assist???

vkSinha · Answer 1 · 2020-01-30T06:27:42.020

convert your paragraph into text and then do a split on '\n'

from docx import Document

# s = Document('f.docx').paragraphs
d = Document()
d.add_paragraph("""The protection of anticipatory or pre-arrest bail cannot
 be limited to any time-
frame or “fixed period” as denial of bail amounts to deprivation of the funda-
mental right to personal liberty in a free and democratic country, a Consti-
tution Bench of the Supreme Court ruled on Wednesday""")
d.add_paragraph("second paragraph")
ans = Document() #new_document
for s in d.paragraphs:
    print(s.text)
    print(s.text.split("\n"))
    str_list = s.text.split("\n")
    new_para = ""
    prev = str_list[0]
    for i in range(1, len(str_list)):
        if prev[-1]=="-":
            prev = prev[:-1]+str_list[i]
        else:
            if new_para =="":
                new_para = new_para + prev
            else:
                new_para = new_para + "\n" + prev
            prev = str_list[i]
    if new_para =="": 
        #if only one str in list
        new_para = new_para  + prev
    else:
        new_para = new_para + "\n" + prev

    ans.add_paragraph(new_para)
    print(new_para)
ans.save("demo.docx")

Hi. Thanks for your response. However the input is a .docx file and not a .txt file. The 4 lines that I gave as sample is one single paragraph and not 4 different lines. Is there an option to do??? — Kathiravan Saimoorthy, Jan 30 '20 at 05:38
@KathiravanSaimoorthy you have to convert your paragraph into text and then you can do all your string manipulation on that. Check this link: https://python-docx.readthedocs.io/en/latest/ — vkSinha, Jan 30 '20 at 06:29

score 0 · Answer 2 · answered Jan 30 '20 at 06:57

I have this one solution where by you set a range of lines you want to retrieve

def pp():
    x = 0
    for i in doc.paragraphs:
        if x < 20:
            print(i.text)
        else:
            break
        x = x + 1

this however will not be feasible if the number of lines in document is always different

Docx Python - Read line by line

2 Answers2