How can I extract highlighted text from PDF file using PHP?

Question

I want to make a web application to extract the highlighted text from a PDF file. I've used fpdf and PDFlib for many purposes but I don't find them helpful in this. Please tell me how I can do it. Or at least tell me which PHP libraries or frameworks can support it. I would like to know even if there is any API I can use for this purpose. I would be highly grateful for your help.

Maybe you could tell us _why_ `PDFlib` does not work for you? Their `PDI` product offers _exactly_ what you describe. — arkascha, Dec 10 '15 at 20:07
In case I can do that with PDFlib, I would like to have a link or anything for better illustration of how it is possible. Thank you very much for your reply. May be I don't understand PDFlib as much as you do. Please be little bit more specific about how I can use PDFlib to extract only highlighted text from pdf files. Thanks! — Waseem Abbas, Dec 11 '15 at 12:23
In my opinion what PDFlib offers is that you can search particular text and then highlight that text which is exactly opposite to what I want. I want to search for the terms that are highlighted and then extract that text. — Waseem Abbas, Dec 11 '15 at 12:56
I referred to `PDI` by the PDFlib company, if you look at my comment. It allows to take a PDF document completely apart and sue whatever you want to with all the bricks you get. It certainly is able to solve your task and it is a very mighty tool. however it also iss a pretty expensive thing. — arkascha, Dec 11 '15 at 13:20
I've searched extensively on PDFlib+PDI . What PDI does in this is that it extracts all the data highlighted by PDFlib. In PDFlib we look for a particular text and highlight it. The PDI does not extract the text on the basis of knowing the difference between highlighted and non-highlighted text but it knows the terms PDFlib looked for, which is quite perfectly perceivable. Now one more thing which I experienced it with PDFlib+TET is that it simply cannot distinguish between highlighted text and non-highlighted. It treats them the same way. You've my undying gratitude for the help so far. — Waseem Abbas, Dec 12 '15 at 10:21

score 0 · Answer 1 · answered Dec 13 '15 at 16:54

You can do this with the SetaPDF-Extractor component (a commercial product of us!). It allows you to access the highlight annotations with which you can create specific filters for the extraction process. A simple example script could look like:

<?php
// load and register the autoload function
require_once('library/SetaPDF/Autoload.php');

// create a document instance
$document = SetaPDF_Core_Document::loadByFilename('path/to/the/highligted.pdf');
// initate an extractor instance
$extractor = new SetaPDF_Extractor($document);

// get page documents pages object
$pages = $document->getCatalog()->getPages();

// we are going to save the results in this variable
$results = array();

// iterate over all pages
for ($pageNo = 1, $pageCount = $pages->count(); $pageNo <= $pageCount; $pageNo++) {
    // get the page object
    $page = $pages->getPage($pageNo);
    // get the highlight annotations
    $annotations = $page->getAnnotations()->getAll(SetaPDF_Core_Document_Page_Annotation::TYPE_HIGHLIGHT);

    // create a strategy instance
    $strategy = new SetaPDF_Extractor_Strategy_Word();
    // create a multi filter instance
    $filter = new SetaPDF_Extractor_Filter_Multi();
    // and pass it to the strategy
    $strategy->setFilter($filter);

    // iterate over all highlight annotations
    foreach ($annotations AS $annotation) {
        /**
         * @var SetaPDF_Core_Document_Page_Annotation_Highlight $annotation
         */
        $name = $annotation->getName();

        // iterate over the quad points to setup our filter instances
        $quadpoints = $annotation->getQuadPoints();
        for ($pos = 0, $c = count($quadpoints); $pos < $c; $pos += 8) {
            $llx = min($quadpoints[$pos + 0], $quadpoints[$pos + 2], $quadpoints[$pos + 4], $quadpoints[$pos + 6]);
            $urx = max($quadpoints[$pos + 0], $quadpoints[$pos + 2], $quadpoints[$pos + 4], $quadpoints[$pos + 6]);
            $lly = min($quadpoints[$pos + 1], $quadpoints[$pos + 3], $quadpoints[$pos + 5], $quadpoints[$pos + 7]);
            $ury = max($quadpoints[$pos + 1], $quadpoints[$pos + 3], $quadpoints[$pos + 5], $quadpoints[$pos + 7]);

            // Add a new rectangle filter to the multi filter instance
            $filter->addFilter(
                new SetaPDF_Extractor_Filter_Rectangle(
                    new SetaPDF_Core_Geometry_Rectangle($llx, $lly, $urx, $ury),
                    SetaPDF_Extractor_Filter_Rectangle::MODE_CONTACT,
                    $name
                )
            );
        }
    }

    // if no filters for this page defined, ignore it
    if (0 === count($filter->getFilters())) {
        continue;
    }

    // pass the strategy to the extractor instance
    $extractor->setStrategy($strategy);
    // and get the results by the current page number
    $pageResult = $extractor->getResultByPageNumber($pageNo);

    // group the resulting words in an result array
    foreach ($pageResult AS $word) {
        $results[$pageNo][$word->getFilterId()][] = $word->getString();
    }
}

// debug output
echo '<pre>';
foreach ($results AS $pageNo => $annotationResults) {
    echo 'Page No #' . $pageNo . "\n";
    foreach ($annotationResults AS $name => $words) {
        echo '  Annotation name: ' . $name . "\n";
        echo '    Result: ' . join(' ', $words). "\n";
        echo '<br />';
    }
}
echo '</pre>';

The output is a simple dump of all found words for each highlight annotation.

Ok. Do you provide any API or I'll have to buy your product? And are you sure that it will help me to extract the highlighted text out of PDF files? I am only interested in getting the highlighted text out of PDF documents. And thank you very much for the advice. I'd certainly give it a try if I found it good for my purpose. — Waseem Abbas, Dec 13 '15 at 19:44
You have to purchase a license for it. But for sure you can test it with an [evaluation version](https://www.setasign.com/products/setapdf-extractor/evaluate/) before. The script above does exactly what you are searching for. It allows you to access the text/words that are marked by highlight annotations. — Jan Slabon, Dec 13 '15 at 22:33

How can I extract highlighted text from PDF file using PHP?

1 Answers1