7

I have a question regarding the splitting of pdf files. basically I have a collection of pdf files, which files I want to split in terms of paragraph. so to each paragraph of the pdf file to be a file on its own. I would appreciate if you can help me with this, preferably in Python, but if that is not possible any language will do.

LoniF
  • 97
  • 1
  • 1
  • 7
  • What are you planning to use with python for extracting the text from PDF? pdf2text can also be used. – Radan Feb 07 '17 at 15:31
  • I am currently writing a program that uses a subprocess call to parse a PDF using pdftotext. It's pretty useful: https://en.wikipedia.org/wiki/Pdftotext – Steampunkery Feb 07 '17 at 15:40
  • @Radan I want to compute the similarity between paragraphs. all the pdf files consist of multiple paragraphs and I want to see how similar are the paragraphs to each other. but first I need to split the pdf files into paragraphs. – LoniF Feb 08 '17 at 15:16
  • 1
    You're losing a lot of information by moving straight to text and there are many parameters for the conversion, specifics depend on the package you're using. But if you choose to access the PDF structure I found pymupdf to be a great option, here's a post that explains how to use the structure for more information in the extraction process: https://towardsdatascience.com/extracting-headers-and-paragraphs-from-pdf-using-pymupdf-676e8421c467 – Veltzer Doron Oct 02 '20 at 06:16

1 Answers1

5

You can use pdftotext for the above, wrap it in python subprocess. Alternatively you could use some other library which already do it implicitly like textract. Here is a quick example, Note: I have used 4 spaces as delimiter to convert the text to paragraph list, you might want to use different technique.

import re
import textract
#read the content of pdf as text
text = textract.process('file_name.pdf')
#use four space as paragraph delimiter to convert the text into list of paragraphs.
print re.split('\s{4,}',text)
Radan
  • 1,630
  • 5
  • 25
  • 38
  • You will also need to install poppler from https://blog.alivate.com.au/poppler-windows/ – Chad Feb 11 '23 at 10:34