1

I am using pypandoc to convert docx files to txt:

f = 'some file.docx'
o = pypandoc.convert_file(f, 'plain', outputfile='file.txt')
assert o == '', o

The problem is that the result is best fitted for visual readability - the text in table columns wrapped and therefore can't be read programmatically.

For example, word "similar" wraps into "s", then go spaces, then go words from other columns and then on the next line the word "imilar" appears, like this:

|s |words|words|

|imilar|words|words|

So it is impossible to read programmatically the word "similar".

I need a result like MS Word provides by saving docx as txt - non-wrapped text. Unfortunately, I am limited in the choice of python libraries.

Is it possible to turn off word wrapping in pypandoc.convert_file?

SergL
  • 11
  • 3

1 Answers1

0

You can add extra argument --wrap=none

extra_args=('--standalone','--wrap=none')

so it will look like this

pypandoc.convert_file(f, 'plain',extra_args=('--standalone','--wrap=none'), outputfile='file.txt')
Narish
  • 607
  • 4
  • 18