0

I have multiple Word documents in a directory. I am using python-docx to clean them up. It's a long code, but one small part of it that you'd think would be the easiest is not working. After making some edits, I need to remove all line breaks and carriage returns. However, the following code doesn't do the job. I've tried different workarounds, such as using for loop to iterate over each character, etc. No results! However, when I tried doing it manually in Notepad++, \r was easily found and replaced.

def remove_line_breaks(document):
    for paragraph in document.paragraphs:
        paragraph.text = paragraph.text.replace('\r', ' ').replace('\n', ' ')
Kyle F Hartzenberg
  • 2,567
  • 3
  • 6
  • 24
Leila
  • 182
  • 1
  • 1
  • 8
  • This code part looks good to me. Could you print out the paragraph.text maybe in the for loop to see what are you modifing? – matebende Apr 14 '23 at 05:46
  • how are you saving the file after modifying it? – jsbueno Apr 15 '23 at 04:08
  • It's a large of documents, but I've tried figuring out the issue by printing part of text as sample. Its not clear at all what the issue is. All the breaks are identified in other programs but not using this code. @matebende – Leila Apr 15 '23 at 13:15
  • Documents are saved after calling all functions. I tried saving it separately as well, but again didn't work. @jsbueno – Leila Apr 15 '23 at 13:16

2 Answers2

1

The section of code looks like it will, in fact, replace all carriage returns and newline characters with spaces. I agree with what was stated above - it would help to see how the program is failing, and what the expected behavior should be. It is tough to tell what is wrong without seeing the full picture.

One thing to note is that if these characters (\r, \n) trail or precede your body of text (similar to if you are reading in a file), you can just use .strip() to achieve the same result. Of course, however, that is only if they are not embedded the string you are stripping.

nulzo
  • 96
  • 6
  • It's really not clear why it's failing. No error. Just doesn't do the job. I've tried ``.strip()`` as well. No difference. – Leila Apr 15 '23 at 13:17
0

If you iterate over an array using the "in" keyword, it creates a new variable. So it can just be used for reading. It would probably work this way:

def remove_line_breaks(document):
    for i, paragraph in enumerate(document.paragraphs):
        document.paragraphs[i].text = paragraph.text.replace('\r', ' ').replace('\n', ' ')

Haven't tried it tho.

IamAino
  • 11
  • 2