-2

I want to make a web application to extract the highlighted text from a PDF file. I've used fpdf and PDFlib for many purposes but I don't find them helpful in this. Please tell me how I can do it. Or at least tell me which PHP libraries or frameworks can support it. I would like to know even if there is any API I can use for this purpose. I would be highly grateful for your help.

  • 2
    Maybe you could tell us _why_ `PDFlib` does not work for you? Their `PDI` product offers _exactly_ what you describe. – arkascha Dec 10 '15 at 20:07
  • In case I can do that with PDFlib, I would like to have a link or anything for better illustration of how it is possible. Thank you very much for your reply. May be I don't understand PDFlib as much as you do. Please be little bit more specific about how I can use PDFlib to extract only highlighted text from pdf files. Thanks! – Waseem Abbas Dec 11 '15 at 12:23
  • In my opinion what PDFlib offers is that you can search particular text and then highlight that text which is exactly opposite to what I want. I want to search for the terms that are highlighted and then extract that text. – Waseem Abbas Dec 11 '15 at 12:56
  • I referred to `PDI` by the PDFlib company, if you look at my comment. It allows to take a PDF document completely apart and sue whatever you want to with all the bricks you get. It certainly is able to solve your task and it is a very mighty tool. however it also iss a pretty expensive thing. – arkascha Dec 11 '15 at 13:20
  • I've searched extensively on PDFlib+PDI . What PDI does in this is that it extracts all the data highlighted by PDFlib. In PDFlib we look for a particular text and highlight it. The PDI does not extract the text on the basis of knowing the difference between highlighted and non-highlighted text but it knows the terms PDFlib looked for, which is quite perfectly perceivable. Now one more thing which I experienced it with PDFlib+TET is that it simply cannot distinguish between highlighted text and non-highlighted. It treats them the same way. You've my undying gratitude for the help so far. – Waseem Abbas Dec 12 '15 at 10:21

1 Answers1

0

You can do this with the SetaPDF-Extractor component (a commercial product of us!). It allows you to access the highlight annotations with which you can create specific filters for the extraction process. A simple example script could look like:

<?php
// load and register the autoload function
require_once('library/SetaPDF/Autoload.php');

// create a document instance
$document = SetaPDF_Core_Document::loadByFilename('path/to/the/highligted.pdf');
// initate an extractor instance
$extractor = new SetaPDF_Extractor($document);

// get page documents pages object
$pages = $document->getCatalog()->getPages();

// we are going to save the results in this variable
$results = array();

// iterate over all pages
for ($pageNo = 1, $pageCount = $pages->count(); $pageNo <= $pageCount; $pageNo++) {
    // get the page object
    $page = $pages->getPage($pageNo);
    // get the highlight annotations
    $annotations = $page->getAnnotations()->getAll(SetaPDF_Core_Document_Page_Annotation::TYPE_HIGHLIGHT);

    // create a strategy instance
    $strategy = new SetaPDF_Extractor_Strategy_Word();
    // create a multi filter instance
    $filter = new SetaPDF_Extractor_Filter_Multi();
    // and pass it to the strategy
    $strategy->setFilter($filter);

    // iterate over all highlight annotations
    foreach ($annotations AS $annotation) {
        /**
         * @var SetaPDF_Core_Document_Page_Annotation_Highlight $annotation
         */
        $name = $annotation->getName();

        // iterate over the quad points to setup our filter instances
        $quadpoints = $annotation->getQuadPoints();
        for ($pos = 0, $c = count($quadpoints); $pos < $c; $pos += 8) {
            $llx = min($quadpoints[$pos + 0], $quadpoints[$pos + 2], $quadpoints[$pos + 4], $quadpoints[$pos + 6]);
            $urx = max($quadpoints[$pos + 0], $quadpoints[$pos + 2], $quadpoints[$pos + 4], $quadpoints[$pos + 6]);
            $lly = min($quadpoints[$pos + 1], $quadpoints[$pos + 3], $quadpoints[$pos + 5], $quadpoints[$pos + 7]);
            $ury = max($quadpoints[$pos + 1], $quadpoints[$pos + 3], $quadpoints[$pos + 5], $quadpoints[$pos + 7]);

            // Add a new rectangle filter to the multi filter instance
            $filter->addFilter(
                new SetaPDF_Extractor_Filter_Rectangle(
                    new SetaPDF_Core_Geometry_Rectangle($llx, $lly, $urx, $ury),
                    SetaPDF_Extractor_Filter_Rectangle::MODE_CONTACT,
                    $name
                )
            );
        }
    }

    // if no filters for this page defined, ignore it
    if (0 === count($filter->getFilters())) {
        continue;
    }

    // pass the strategy to the extractor instance
    $extractor->setStrategy($strategy);
    // and get the results by the current page number
    $pageResult = $extractor->getResultByPageNumber($pageNo);

    // group the resulting words in an result array
    foreach ($pageResult AS $word) {
        $results[$pageNo][$word->getFilterId()][] = $word->getString();
    }
}

// debug output
echo '<pre>';
foreach ($results AS $pageNo => $annotationResults) {
    echo 'Page No #' . $pageNo . "\n";
    foreach ($annotationResults AS $name => $words) {
        echo '  Annotation name: ' . $name . "\n";
        echo '    Result: ' . join(' ', $words). "\n";
        echo '<br />';
    }
}
echo '</pre>';

The output is a simple dump of all found words for each highlight annotation.

Jan Slabon
  • 4,736
  • 2
  • 14
  • 29
  • Ok. Do you provide any API or I'll have to buy your product? And are you sure that it will help me to extract the highlighted text out of PDF files? I am only interested in getting the highlighted text out of PDF documents. And thank you very much for the advice. I'd certainly give it a try if I found it good for my purpose. – Waseem Abbas Dec 13 '15 at 19:44
  • You have to purchase a license for it. But for sure you can test it with an [evaluation version](https://www.setasign.com/products/setapdf-extractor/evaluate/) before. The script above does exactly what you are searching for. It allows you to access the text/words that are marked by highlight annotations. – Jan Slabon Dec 13 '15 at 22:33