How to conserve the pdf layout after converting content from English to French using Python

Question

I am working on a simple application which will help me to convert all my pdf files which have text in English to French text as pdf. I have worked on a simple proof of concept which helps me to iterate over the given file and convert all text into French. Now I am stuck on saving the converted french text into a pdf with a similar structure of the original English version.

import PyPDF2
from googletrans import Translator
translator = Translator()

read_pdf = PyPDF2.PdfFileReader(open('any_english.pdf', 'rb'))
write_pdf = PyPDF2.PdfFileWriter()
number_of_pages = read_pdf.getNumPages()

for i in range(number_of_pages):
    page = read_pdf.getPage(i)
    page_content = page.extractText()
    print translator.translate(page_content, dest='fr').text

    // Save the converted version text in french into a pdf conserving structure as original pdf

**Note

All contents in the pdf are text format not image.

The .extractText() method strips any formatting information about the page, and doesn't even guarantee you get the text back in any correct "order", as far as I know. You'll be unable to recreate the page's structure and format with this method. I don't know of a way to do what you're looking to do with this library. — Daniel Harms, Mar 02 '18 at 13:13

score 3 · Accepted Answer · answered Mar 08 '18 at 10:37

3

There are no easy ways to open, edit and rewrite pdfs in Python. However, depending on the complexity of the PDF/structure you might have success converting the PDF to HTML, translating and then generating a PDF from the HTML.

For converting PDF to HTML, there is pdf2html which has a basic Python wrapper.

Once the translation is done you can reverse this process with various degrees of success using e.g. weasyprint, html2pdf (Mac only), wkhtmltopdf (requires Qt).

answered Mar 08 '18 at 10:37

mfitzp

15,275
7
50
70

I am searching for an example. – Bastin Robin Mar 09 '18 at 07:08
1

@BastinRobin thanks for accepting — did you manage to make it work for you? Let me know if you need some more help in getting a working example up and running. – mfitzp Mar 09 '18 at 19:58
can you help me with an example? @mfitzp – Bastin Robin Mar 15 '18 at 04:16
@BastinRobin can't give full example now (away for a few days) but generally: use subprocess module to run external pdf2html to generate html. Open the html and run translator.translate over it, save to a new file. Then weasyprint using subprocess to generate the pdf output (new file). – mfitzp Mar 15 '18 at 09:14

score 1 · Answer 2 · answered Mar 08 '18 at 10:41

1

Basically you cant directly create a PDF file in a specific format. But you can try writing your data in xhtml format then convert into .pdf using xhtml2pdf. Hope this might help you in your requirement.

answered Mar 08 '18 at 10:41

Induprasad

81
1
4

score -1 · Answer 3 · answered Jan 30 '20 at 17:01

-1

You can use textract

import textract
text = textract.process('path/to/a.pdf', language='fr')

by default it preserves the layout

answered Jan 30 '20 at 17:01

Henrique

9
3

If it did it no longer does apparently. – mathtick Aug 12 '23 at 14:22

How to conserve the pdf layout after converting content from English to French using Python

3 Answers3