-1

I am working on a simple application which will help me to convert all my pdf files which have text in English to French text as pdf. I have worked on a simple proof of concept which helps me to iterate over the given file and convert all text into French. Now I am stuck on saving the converted french text into a pdf with a similar structure of the original English version.

import PyPDF2
from googletrans import Translator
translator = Translator()

read_pdf = PyPDF2.PdfFileReader(open('any_english.pdf', 'rb'))
write_pdf = PyPDF2.PdfFileWriter()
number_of_pages = read_pdf.getNumPages()

for i in range(number_of_pages):
    page = read_pdf.getPage(i)
    page_content = page.extractText()
    print translator.translate(page_content, dest='fr').text

    // Save the converted version text in french into a pdf conserving structure as original pdf

**Note

All contents in the pdf are text format not image.

Bastin Robin
  • 907
  • 16
  • 30
  • 1
    The .extractText() method strips any formatting information about the page, and doesn't even guarantee you get the text back in any correct "order", as far as I know. You'll be unable to recreate the page's structure and format with this method. I don't know of a way to do what you're looking to do with this library. – Daniel Harms Mar 02 '18 at 13:13
  • Any other methods to accomplish this task? @DanielHarms – Bastin Robin Mar 03 '18 at 01:17

3 Answers3

3

There are no easy ways to open, edit and rewrite pdfs in Python. However, depending on the complexity of the PDF/structure you might have success converting the PDF to HTML, translating and then generating a PDF from the HTML.

For converting PDF to HTML, there is pdf2html which has a basic Python wrapper.

Once the translation is done you can reverse this process with various degrees of success using e.g. weasyprint, html2pdf (Mac only), wkhtmltopdf (requires Qt).

mfitzp
  • 15,275
  • 7
  • 50
  • 70
  • I am searching for an example. – Bastin Robin Mar 09 '18 at 07:08
  • 1
    @BastinRobin thanks for accepting —  did you manage to make it work for you? Let me know if you need some more help in getting a working example up and running. – mfitzp Mar 09 '18 at 19:58
  • can you help me with an example? @mfitzp – Bastin Robin Mar 15 '18 at 04:16
  • @BastinRobin can't give full example now (away for a few days) but generally: use subprocess module to run external pdf2html to generate html. Open the html and run translator.translate over it, save to a new file. Then weasyprint using subprocess to generate the pdf output (new file). – mfitzp Mar 15 '18 at 09:14
1

Basically you cant directly create a PDF file in a specific format. But you can try writing your data in xhtml format then convert into .pdf using xhtml2pdf. Hope this might help you in your requirement.

Induprasad
  • 81
  • 1
  • 4
-1

You can use textract

import textract
text = textract.process('path/to/a.pdf', language='fr')

by default it preserves the layout

Henrique
  • 9
  • 3