0

i installed PdfParser with composer and it works when i open the page cron.php. The pdf is parsed.

this is my code in cron.php:

include 'vendor/autoload.php';
//include  $_SERVER["DOCUMENT_ROOT"]. '/vendor/autoload.php';
//require 'vendor/autoload.php';
$parser = new \Smalot\PdfParser\Parser();
$pdf    = $parser->parseFile("$path/$fname");
$text   = $pdf->getText();
$pdf    = $parser->parseFile("vendor/smalot/pdfparser/samples/1.pdf");
$text   = $pdf->getText();
echo $text;
exit();

i setup a cron in ubuntu 16 server to launch the page cron.php with this code:

 * * * * * /usr/bin/php -q /var/www/html/..../public_html/post/cron.php >>/var/www/html/..../public_html/post/log/cron.php.log 2>&1

the page works but the log say me that:

Fatal error:  Uncaught Error: Class 'Smalot\PdfParser\Parser' not found in /var/www/html/..../public_html/post/cron.php:161
Stack trace:
#0 /var/www/html/..../public_html/post/cron.php(62): getpart(Resource id #8, 451, Object(stdClass), 2)
#1 /var/www/html/..../public_html/post/cron.php(378): getmsg(Resource id #8, 451)
#2 {main}
  thrown in /var/www/html/..../public_html/post/cron.php on line 161

this is my autoload.php

?php
/*
Using PDFParser without Composer
Folder structure
================
webroot
  pdfdemos
    INV001.pdf # test PDF file to extract text from for demo
    test.php # our operational demo file
  vendor
    autoload.php
    tecnickcom
      tcpdf # unpack v6.2.12 from release at https://github.com/tecnickcom/TCPDF/archive/6.2.12.tar.gz
    smalot
      pdfparser # unpack from git master https://github.com/smalot/pdfparser/archive/master.zip release is 0.9.25 dated 2015-09-15
        docs # optional
        samples # optional
        src
          Smalot
            PdfParser
*/

$vendorDir = 'vendor';
//$vendorDir = $_SERVER["DOCUMENT_ROOT"] . '/vendor';
$tcpdf_files = Array(
    'Datamatrix' => $vendorDir . '/tecnickcom/tcpdf/include/barcodes/datamatrix.php',
    'PDF417' => $vendorDir . '/tecnickcom/tcpdf/include/barcodes/pdf417.php',
    'QRcode' => $vendorDir . '/tecnickcom/tcpdf/include/barcodes/qrcode.php',
    'TCPDF' => $vendorDir . '/tecnickcom/tcpdf/tcpdf.php',
    'TCPDF2DBarcode' => $vendorDir . '/tecnickcom/tcpdf/tcpdf_barcodes_2d.php',
    'TCPDFBarcode' => $vendorDir . '/tecnickcom/tcpdf/tcpdf_barcodes_1d.php',
    'TCPDF_COLORS' => $vendorDir . '/tecnickcom/tcpdf/include/tcpdf_colors.php',
    'TCPDF_FILTERS' => $vendorDir . '/tecnickcom/tcpdf/include/tcpdf_filters.php',
    'TCPDF_FONTS' => $vendorDir . '/tecnickcom/tcpdf/include/tcpdf_fonts.php',
    'TCPDF_FONT_DATA' => $vendorDir . '/tecnickcom/tcpdf/include/tcpdf_font_data.php',
    'TCPDF_IMAGES' => $vendorDir . '/tecnickcom/tcpdf/include/tcpdf_images.php',
    'TCPDF_IMPORT' => $vendorDir . '/tecnickcom/tcpdf/tcpdf_import.php',
    'TCPDF_PARSER' => $vendorDir . '/tecnickcom/tcpdf/tcpdf_parser.php',
    'TCPDF_STATIC' => $vendorDir . '/tecnickcom/tcpdf/include/tcpdf_static.php'
);

foreach ($tcpdf_files as $key => $file) {
    include_once $file;
}

include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/Parser.php";
include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/Document.php";
include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/Header.php";
include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/PDFObject.php";
include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/Element.php";
include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/Encoding.php";
include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/Font.php";
include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/Page.php";
include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/Pages.php";
include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/Element/ElementArray.php";
include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/Element/ElementBoolean.php";
include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/Element/ElementString.php";
include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/Element/ElementDate.php";
include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/Element/ElementHexa.php";
include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/Element/ElementMissing.php";
include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/Element/ElementName.php";
include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/Element/ElementNull.php";
include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/Element/ElementNumeric.php";
include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/Element/ElementStruct.php";
include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/Element/ElementXRef.php";

include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/Encoding/StandardEncoding.php";
include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/Encoding/ISOLatin1Encoding.php";
include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/Encoding/ISOLatin9Encoding.php";
include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/Encoding/MacRomanEncoding.php";
include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/Encoding/WinAnsiEncoding.php";
include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/Font/FontCIDFontType0.php";
include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/Font/FontCIDFontType2.php";
include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/Font/FontTrueType.php";
include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/Font/FontType0.php";
include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/Font/FontType1.php";
include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/XObject/Form.php";
include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/XObject/Image.php";

and this is my "path/file" where the log say that is the missing class public_html/post/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php

<?php

/**
 * @file
 *          This file is part of the PdfParser library.
 *
 * @author  Sébastien MALOT <sebastien@malot.fr>
 * @date    2017-01-03
 * @license LGPLv3
 * @url     <https://github.com/smalot/pdfparser>
 *
 *  PdfParser is a pdf library written in PHP, extraction oriented.
 *  Copyright (C) 2017 - Sébastien MALOT <sebastien@malot.fr>
 *
 *  This program is free software: you can redistribute it and/or modify
 *  it under the terms of the GNU Lesser General Public License as published by
 *  the Free Software Foundation, either version 3 of the License, or
 *  (at your option) any later version.
 *
 *  This program is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *  GNU Lesser General Public License for more details.
 *
 *  You should have received a copy of the GNU Lesser General Public License
 *  along with this program.
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
 *
 */

namespace Smalot\PdfParser;

use Smalot\PdfParser\Element\ElementArray;
use Smalot\PdfParser\Element\ElementBoolean;
use Smalot\PdfParser\Element\ElementDate;
use Smalot\PdfParser\Element\ElementHexa;
use Smalot\PdfParser\Element\ElementName;
use Smalot\PdfParser\Element\ElementNull;
use Smalot\PdfParser\Element\ElementNumeric;
use Smalot\PdfParser\Element\ElementString;
use Smalot\PdfParser\Element\ElementXRef;

/**
 * Class Parser
 *
 * @package Smalot\PdfParser
 */
class Parser
{
    /**
     * @var PDFObject[]
     */
    protected $objects = array();

    /**
     *
     */
    public function __construct()
    {

    }

    /**
     * @param $filename
     * @return Document
     * @throws \Exception
     */
    public function parseFile($filename)
    {
        $content = file_get_contents($filename);
        /*
         * 2018/06/20 @doganoo as multiple times a
         * users have complained that the parseFile()
         * method dies silently, it is an better option
         * to remove the error control operator (@) and
         * let the users know that the method throws an exception
         * by adding @throws tag to PHPDoc.
         *
         * See here for an example: https://github.com/smalot/pdfparser/issues/204
         */
        return $this->parseContent($content);
    }

    /**
     * @param $content
     * @return Document
     * @throws \Exception
     */
    public function parseContent($content)
    {
        // Create structure using TCPDF Parser.
        ob_start();
        @$parser = new \TCPDF_PARSER(ltrim($content));
        list($xref, $data) = $parser->getParsedData();
        unset($parser);
        ob_end_clean();

        if (isset($xref['trailer']['encrypt'])) {
            throw new \Exception('Secured pdf file are currently not supported.');
        }

        if (empty($data)) {
            throw new \Exception('Object list not found. Possible secured file.');
        }

        // Create destination object.
        $document      = new Document();
        $this->objects = array();

        foreach ($data as $id => $structure) {
            $this->parseObject($id, $structure, $document);
            unset($data[$id]);
        }

        $document->setTrailer($this->parseTrailer($xref['trailer'], $document));
        $document->setObjects($this->objects);

        return $document;
    }

    protected function parseTrailer($structure, $document)
    {
        $trailer = array();

        foreach ($structure as $name => $values) {
            $name = ucfirst($name);

            if (is_numeric($values)) {
                $trailer[$name] = new ElementNumeric($values, $document);
            } elseif (is_array($values)) {
                $value          = $this->parseTrailer($values, null);
                $trailer[$name] = new ElementArray($value, null);
            } elseif (strpos($values, '_') !== false) {
                $trailer[$name] = new ElementXRef($values, $document);
            } else {
                $trailer[$name] = $this->parseHeaderElement('(', $values, $document);
            }
        }

        return new Header($trailer, $document);
    }

    /**
     * @param string   $id
     * @param array    $structure
     * @param Document $document
     */
    protected function parseObject($id, $structure, $document)
    {
        $header  = new Header(array(), $document);
        $content = '';

        foreach ($structure as $position => $part) {
            switch ($part[0]) {
                case '[':
                    $elements = array();

                    foreach ($part[1] as $sub_element) {
                        $sub_type   = $sub_element[0];
                        $sub_value  = $sub_element[1];
                        $elements[] = $this->parseHeaderElement($sub_type, $sub_value, $document);
                    }

                    $header = new Header($elements, $document);
                    break;

                case '<<':
                    $header = $this->parseHeader($part[1], $document);
                    break;

                case 'stream':
                    $content = isset($part[3][0]) ? $part[3][0] : $part[1];

                    if ($header->get('Type')->equals('ObjStm')) {
                        $match = array();

                        // Split xrefs and contents.
                        preg_match('/^((\d+\s+\d+\s*)*)(.*)$/s', $content, $match);
                        $content = $match[3];

                        // Extract xrefs.
                        $xrefs = preg_split(
                            '/(\d+\s+\d+\s*)/s',
                            $match[1],
                            -1,
                          PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE
                        );
                        $table = array();

                        foreach ($xrefs as $xref) {
                            list($id, $position) = explode(' ', trim($xref));
                            $table[$position] = $id;
                        }

                        ksort($table);

                        $ids       = array_values($table);
                        $positions = array_keys($table);

                        foreach ($positions as $index => $position) {
                            $id            = $ids[$index] . '_0';
                            $next_position = isset($positions[$index + 1]) ? $positions[$index + 1] : strlen($content);
                            $sub_content   = substr($content, $position, $next_position - $position);

                            $sub_header         = Header::parse($sub_content, $document);
                            $object             = PDFObject::factory($document, $sub_header, '');
                            $this->objects[$id] = $object;
                        }

                        // It is not necessary to store this content.
                        $content = '';

                        return;
                    }
                    break;

                default:
                    if ($part != 'null') {
                        $element = $this->parseHeaderElement($part[0], $part[1], $document);

                        if ($element) {
                            $header = new Header(array($element), $document);
                        }
                    }
                    break;

            }
        }

        if (!isset($this->objects[$id])) {
            $this->objects[$id] = PDFObject::factory($document, $header, $content);
        }
    }

    /**
     * @param array    $structure
     * @param Document $document
     *
     * @return Header
     * @throws \Exception
     */
    protected function parseHeader($structure, $document)
    {
        $elements = array();
        $count    = count($structure);

        for ($position = 0; $position < $count; $position += 2) {
            $name  = $structure[$position][1];
            $type  = $structure[$position + 1][0];
            $value = $structure[$position + 1][1];

            $elements[$name] = $this->parseHeaderElement($type, $value, $document);
        }

        return new Header($elements, $document);
    }

    /**
     * @param $type
     * @param $value
     * @param $document
     *
     * @return Element|Header
     * @throws \Exception
     */
    protected function parseHeaderElement($type, $value, $document)
    {
        switch ($type) {
            case '<<':
                return $this->parseHeader($value, $document);

            case 'numeric':
                return new ElementNumeric($value, $document);

            case 'boolean':
                return new ElementBoolean($value, $document);

            case 'null':
                return new ElementNull($value, $document);

            case '(':
                if ($date = ElementDate::parse('(' . $value . ')', $document)) {
                    return $date;
                } else {
                    return ElementString::parse('(' . $value . ')', $document);
                }

            case '<':
                return $this->parseHeaderElement('(', ElementHexa::decode($value, $document), $document);

            case '/':
                return ElementName::parse('/' . $value, $document);

            case 'ojbref': // old mistake in tcpdf parser
            case 'objref':
                return new ElementXRef($value, $document);

            case '[':
                $values = array();

                foreach ($value as $sub_element) {
                    $sub_type  = $sub_element[0];
                    $sub_value = $sub_element[1];
                    $values[]  = $this->parseHeaderElement($sub_type, $sub_value, $document);
                }

                return new ElementArray($values, $document);

            case 'endstream':
            case 'obj': //I don't know what it means but got my project fixed.
            case '':
                // Nothing to do with.
                break;

            default:
                throw new \Exception('Invalid type: "' . $type . '".');
        }
    }
}

it parse the pdf when i launch manually the cron.php but not in crontab i am stuck from 4 days and i cannot figure where is the problem. Please i need your advice. Thanks Emil.

emil_alm
  • 9
  • 1
  • 3
  • 1
    I'm far from a Linux/system expert, but I believe running the cron command manually and through a cron job makes only one potential difference: which user runs it. When you say you launch it manually, do you do it with the exact same command that's in the crontab? Everything looks quite normal from the rest. Is your error_reporting set to E_ALL? No additional errors when using `require` instead of `include`? – Jeto Jan 03 '20 at 03:05
  • On an unrelated side note, your composer dependencies should not be stored within your webserver's public directory. Neither should your cron scripts, btw (since they're to be executed from the command line executable). – Jeto Jan 03 '20 at 03:17
  • when i launch the file manualy i want to say that i'm going into firefox and i launch the webpage. The user is www-data because is apache2. With the crontab the user is different, the task is under emil user. Both work to take the email but if i have an pdf attachement in the email to be parsed i have this error when the job is launched by crontab user emil. – emil_alm Jan 03 '20 at 13:33
  • it seems to be a user problem i will try to create the same cron job under www-data user to check – emil_alm Jan 03 '20 at 13:57
  • I'd say it's more likely related to the include path. Did you check the error_reporting level as I mentioned earlier? Also, are you certain `$_SERVER["DOCUMENT_ROOT"]` isn't used at any point in your scripts? That isn't defined when running them via CLI. – Jeto Jan 03 '20 at 14:03
  • yes i have report errors ALL ```error_reporting(E_ALL); ini_set('display_errors', 1);``` and the only one path defined is: __DIR__ . '/att1'; that i think that is good. – emil_alm Jan 03 '20 at 21:07
  • Hi Jeto, i found something weird in my opinion: all the files that is in the folder /src/Smalot/PdfParser/ are 664 permission. that means that cannot been executed just read and write by owner or group, in your opinion is normal? – emil_alm Jan 03 '20 at 21:17
  • The way how you run your stuff should be part of your question, go and [edit] the question to clarify stuff. For any further analysis, you'd have to extract and provide the exact steps required to reproduce the issue, i.e. a [mcve]. – Ulrich Eckhardt Jan 03 '20 at 22:52

1 Answers1

0

ok it works i found a manner to do it working:

i insert in cron.php this command: echo "getcwd=" . getcwd(); and i observe that the curent directory was wrong so i move the crontab to root to have the current directory root after that i adjust the paths to fit. Thanks Jeto for your support.

emil_alm
  • 9
  • 1
  • 3