How can I obtain the coordinates/extent from a geospatial PDF?

Question

I have a geospatial PDF exported by QGIS with a raster inside, I need to read this raster to create a XYZ tiles structure, and for that it's necessary to get file's coordinates/extent, but I can't extract this information from the PDF or even the raster file. I already tried to transform the PDF in text and read it, I also tried to extract the raster from the PDF file to read the image, but none of the alternatives worked. I didn't find it anywhere about where and how to get this information from the file.

What I tried:

With the PDF: I tried to read the PDF metadata using the PDFJS.

PDFJS.getDocument(url).then(function (pdfDoc_) {
      pdfDoc = pdfDoc_;   
      pdfDoc.getMetadata().then(function(stuff) {
          console.log(stuff); // Metadata object here
      }).catch(function(err) {
         console.log('Error getting meta data');
         console.log(err);
      });

     // Render the first page or whatever here
     // More code . . . 
}).catch(function(err) {
     console.log('Error getting PDF from ' + url);
     console.log(err);
});

Source code

I also tried to read through some websites that read the metadata, Website 1, Website 2, for example. Still with PDFJS, I tried to transform a PDF to text to see if I could identify something.

var PDF_URL  = '/path/to/example.pdf';

PDFJS.getDocument(PDF_URL).then(function (PDFDocumentInstance) {
    
    var totalPages = PDFDocumentInstance.numPages;
    var pageNumber = 1;

    // Extract the text
    getPageText(pageNumber , PDFDocumentInstance).then(function(textPage){
        // Show the text of the page in the console
        console.log(textPage);
    });

}, function (reason) {
    // PDF loading error
    console.error(reason);
});

/**
 * Retrieves the text of a specif page within a PDF Document obtained through pdf.js 
 * 
 * @param {Integer} pageNum Specifies the number of the page 
 * @param {PDFDocument} PDFDocumentInstance The PDF document obtained 
 **/
function getPageText(pageNum, PDFDocumentInstance) {
    // Return a Promise that is solved once the text of the page is retrieven
    return new Promise(function (resolve, reject) {
        PDFDocumentInstance.getPage(pageNum).then(function (pdfPage) {
            // The main trick to obtain the text of the PDF page, use the getTextContent method
            pdfPage.getTextContent().then(function (textContent) {
                var textItems = textContent.items;
                var finalString = "";

                // Concatenate the string of the item to the final string
                for (var i = 0; i < textItems.length; i++) {
                    var item = textItems[i];

                    finalString += item.str + " ";
                }

                // Solve promise with the text retrieven from the page
                resolve(finalString);
            });
        });
    });
}

Source Code

Sadly I didn't get any information. I did the same thing through some other sites, Website 1, Website 2, Website 3, but the results were still empty.

With the image: I tried to extract the raster from the PDF using a code in JS that I found on codePen, then I tried to read the image metadata with Windows, only later I realized that this code generated an image in SVG and the metadata was lost with it, so I used some websites with functions to extract the image from the PDF file, Website 1, Website 2. Then I tried to see the metadata of the images with Windows again, but without success, so I used some websites to see metada from the extract images, Website 1, Website 2, but I also didn't have any of the information I needed (coordinates/extension).

I went back to research to see if I could find a code in another language or even in QGIS. While searching I found a documentation of GDAL about reading PDFs through some tools like PDFIUM, so I tried to download GDAL through OSGeo4W and execute a code in python to see these information:

from osgeo import ogr  
from osgeo import gdal

dso = ogr.Open('export.pdf')
print(dso)

dsg = gdal.Open('export.pdf')
print(dsg)

driverGeoPDFogr = ogr.GetDriverByName('PDF')
dso2 = driverGeoPDFogr.Open('export.pdf')
print(dso2)

I tried to run through the shell of OSGEO4W, but I had some problems with the python environment, more specifically an error in importing modules.

Error:

ImportError: Module use of python39.dll conflicts with this version of Python.

But running this code wouldn't be that important for the advancement of what I wanted to do, so I decided to leave it aside for now.

Then I tried to look the geospatial PDF import/export codes from QGIS and GDAL repository, but I was not successful in finding those codes.

I need to know if there is a way to capture the coordinates/extent of a geospatial PDF using javascript.

I assume that the pdf is valid as I exported it directly from QGIS. — Gabi Vieira, May 17 '22 at 16:18
When I imported the file again, the coordinates, extension and projection were correct as expected. — Gabi Vieira, May 17 '22 at 16:35
I tried to do this with my Geospatial PDF, but for all cases, being by the code with PDFJS or with the functions of the sites I mentioned, the result was always the same: an empty string. If you want to check out, my pdf is here: https://wetransfer.com/downloads/69489430bb5f90f83452673027d8a9d620220517173905/706b59 — Gabi Vieira, May 17 '22 at 17:45
I didn't even tought about opening the file through notepad, this solves my problems because I can read the file as a text through javascript now. Thank you very much my friend, if you want to answer the question I will mark it as accepted. — Gabi Vieira, May 17 '22 at 19:54
The file is in poor quality and low resolution on purpose, I did this while exporting to make testing it easier. — Gabi Vieira, May 17 '22 at 19:58

Zach Young · Answer 1 · 2022-05-18T21:05:15.807

I found a way to do this with PDF.js.

As @K_J pointed out, there are dictionary items in the PDF that relate to geospatial features.

Adobe added the Geospatial Features specification and defined a "geospatial measures dictionary" that can be included in the PDF. If it's included it must contain the GPTS key, which defines the extent of the geographical space in latitude and longitude^1:

GPTS (array) (Required; ExtensionLevel 3) An array of numbers taken pairwise, defining points in geographic space as degrees of latitude and longitude...

There's also the WKT key, which K_J pointed out:

WKT (ASCII string) (Optional; ExtensionLevel 3) A string of Well Known Text describing the geographic coordinate system.

This is the "Adobe way" of doing it; there's also the OGC way^2:

The georeferencing metadata for geospatial PDF is most commonly encoded in one of two ways: the OGC best practice; and as Adobe's proposed geospatial extensions to ISO 32000.

Though I cannot find the OGC definition.

But your PDF seems to be using the Adobe way. So, how to get that dictionary key and its values with PDF.js...

From the PDF.js documentation I found a reference to this project^3 which is a pretty simple "PDF object browser". I forked it, and made it non-interactive so it walks the whole tree and logs a path through the tree if it finds the GPTS key.

Here's a snippet from my version of browser.js:

...
const MAX_DEPTH = 10;
...
function walk(node, callDepth, nodePath) {
    // Not sure about this, but I think I'm directing the walker to completely resolve referenced nodes
    while (isRef(node.obj)) {
        var fetched = xref.fetch(node.obj);
        node = new Node(fetched, node.name, node.depth, node.obj);
    }

    nodePath += ' '.repeat(node.depth) + ' - ' + toText(node) + '\n';

    if (node.name === 'GPTS') {
        console.log(nodePath);
        printCoords(node);
        return;
    }

    if (callDepth > MAX_DEPTH) {
        return;
    }

    for (const childNode of node.children) {
        walk(childNode, callDepth + 1, nodePath);
    }
}

function printCoords(gPTSNode) {
    for (const childNode of gPTSNode.children) {
        var path = ' '.repeat(childNode.depth) + ' - ' + toText(childNode);
        console.log(path);
    }
}

When I launch index.html from that project, and open your sample PDF, I get the following in the console:

- Trailer (dict)
 - Root (dict) [id: 2, gen: 0]
  - Pages (dict) [id: 1, gen: 0]
   - Kids (array)
    - 0 (dict) [id: 8, gen: 0]
     - VP (array)
      - 0 (dict) [id: 5, gen: 0]
       - Measure (dict) [id: 6, gen: 0]
        - GPTS (array)
          - 0 = 6965524.305664567
          - 1 = 582854.0718590557
          - 2 = 6965524.305664567
          - 3 = 585458.7618590547
          - 4 = 6963682.605664568
          - 5 = 582854.0718590557
          - 6 = 6963682.605664568
          - 7 = 585458.7618590547

The children of the GPTS array are a set of coordinates that should be in your world's coordinate system.

You can play around with the MAX_DEPTH var and see how many (indirect?) references there are to this dictionary. The higher the depth threshold, the more references you'll find buried in the tree.

This is my first time investigating the tree and objects, and I'm happy to see that the path correlates 1:1 with the object viewer in Acrobat:

score 0 · Answer 2 · answered Jun 22 '22 at 13:18

0

Perhaps I have misunderstood the question, but you can use gdal2tiles.py to create zxy tile structures from rasters. You'll only need gdal installed for that https://gdal.org/programs/gdal2tiles.html

answered Jun 22 '22 at 13:18

sobmortin

55
9

2

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jun 24 '22 at 06:10

How can I obtain the coordinates/extent from a geospatial PDF?

2 Answers2