0

I am using pdfparser for copy text from PDF files but some PDF files are copy protected or have different fonts so that pdfparser not working for that, is it possible to get text from copy protected PDF?

This is my Code :

// Include Composer autoloader if not already done.
error_reporting(E_ALL);
ini_set('display_errors', 1);
include 'vendor/autoload.php';

// Parse pdf file and build necessary objects.
$parser = new \Smalot\PdfParser\Parser();
$pdf    = $parser->parseFile('tests.pdf');

// Retrieve all pages from the pdf file.
$pages  = $pdf->getPages();

// Loop over each page to extract text.
foreach ($pages as $page) {
    echo utf8_encode($page->getText());
}

?>

After trying this code I am not getting any error or warning. This code is only showing blank space. I have also try utf-8 encoding but still it is not working?

Dave
  • 5,108
  • 16
  • 30
  • 40
V.p. Dixit
  • 19
  • 3

2 Answers2

0

If the author of the PDF specified the Permissions flags of the document to not permit Copying or Extracting Text and Graphics then you should consider that. Not all PDF software respects such restrictions however.

JosephA
  • 1,187
  • 3
  • 13
  • 27
  • another question i want to ask why browser not show maths equations if i am getting from pdf ? – V.p. Dixit May 21 '19 at 10:36
  • Not all PDF software is created equal, I'd compare with multiple PDF viewers and check if the content displays as expected. – JosephA May 22 '19 at 02:01
  • Some PDF may be password protected, not for forbidding extraction, but for preventing modifications. Example: my bank account statements. It is however perfectly allowable to read them without password and to cut and paste from them. – GingkoFr Jan 08 '23 at 09:20
0

\Smalot\PdfParser can't extract password protected files.

I've found a far better solution for that (providing your PHP service is running on a Linux server): use the command line tool “pdftotext(included in the “poppler” package in, for example, Debian or Ubuntu).

It perfectly handles password protected files (it has an option to give password if required).

Used with something like this, inside a PHP script under web server on a Linux server, with a PDF file submitted through a web form:

// $filepath is the full file path properly extracted from the $_FILES variable 
// after form submission.
// Expected running under Linux+Apache+PHP; if not, you may have to find your way.

if (! file_exists($filepath)) {
    // In case systemd private temporary directory feature is active.
    $filepath = '/proc/'.posix_getppid().'/root'.$filepath;
}

$cwdt = 4;  // may be better fine tuned for better column alignment

// “sudo” is necessary mostly with systemd private temporary directory
// feature. Needs proper sudoers configuration, of course.
$cmd = "sudo /usr/bin/pdftotext -nopgbrk -fixed {$cwdt} {$filepath} -";

exec($cmd, $output, $res);

print_r($output);

I don't know if it is an answer to the “or having different fonts” requirement, however.

GingkoFr
  • 72
  • 5