I have a geospatial PDF exported by QGIS with a raster inside, I need to read this raster to create a XYZ tiles structure, and for that it's necessary to get file's coordinates/extent, but I can't extract this information from the PDF or even the raster file. I already tried to transform the PDF in text and read it, I also tried to extract the raster from the PDF file to read the image, but none of the alternatives worked. I didn't find it anywhere about where and how to get this information from the file.
What I tried:
With the PDF: I tried to read the PDF metadata using the PDFJS.
PDFJS.getDocument(url).then(function (pdfDoc_) {
pdfDoc = pdfDoc_;
pdfDoc.getMetadata().then(function(stuff) {
console.log(stuff); // Metadata object here
}).catch(function(err) {
console.log('Error getting meta data');
console.log(err);
});
// Render the first page or whatever here
// More code . . .
}).catch(function(err) {
console.log('Error getting PDF from ' + url);
console.log(err);
});
I also tried to read through some websites that read the metadata, Website 1, Website 2, for example. Still with PDFJS, I tried to transform a PDF to text to see if I could identify something.
var PDF_URL = '/path/to/example.pdf';
PDFJS.getDocument(PDF_URL).then(function (PDFDocumentInstance) {
var totalPages = PDFDocumentInstance.numPages;
var pageNumber = 1;
// Extract the text
getPageText(pageNumber , PDFDocumentInstance).then(function(textPage){
// Show the text of the page in the console
console.log(textPage);
});
}, function (reason) {
// PDF loading error
console.error(reason);
});
/**
* Retrieves the text of a specif page within a PDF Document obtained through pdf.js
*
* @param {Integer} pageNum Specifies the number of the page
* @param {PDFDocument} PDFDocumentInstance The PDF document obtained
**/
function getPageText(pageNum, PDFDocumentInstance) {
// Return a Promise that is solved once the text of the page is retrieven
return new Promise(function (resolve, reject) {
PDFDocumentInstance.getPage(pageNum).then(function (pdfPage) {
// The main trick to obtain the text of the PDF page, use the getTextContent method
pdfPage.getTextContent().then(function (textContent) {
var textItems = textContent.items;
var finalString = "";
// Concatenate the string of the item to the final string
for (var i = 0; i < textItems.length; i++) {
var item = textItems[i];
finalString += item.str + " ";
}
// Solve promise with the text retrieven from the page
resolve(finalString);
});
});
});
}
Sadly I didn't get any information. I did the same thing through some other sites, Website 1, Website 2, Website 3, but the results were still empty.
With the image: I tried to extract the raster from the PDF using a code in JS that I found on codePen, then I tried to read the image metadata with Windows, only later I realized that this code generated an image in SVG and the metadata was lost with it, so I used some websites with functions to extract the image from the PDF file, Website 1, Website 2. Then I tried to see the metadata of the images with Windows again, but without success, so I used some websites to see metada from the extract images, Website 1, Website 2, but I also didn't have any of the information I needed (coordinates/extension).
I went back to research to see if I could find a code in another language or even in QGIS. While searching I found a documentation of GDAL about reading PDFs through some tools like PDFIUM, so I tried to download GDAL through OSGeo4W and execute a code in python to see these information:
from osgeo import ogr
from osgeo import gdal
dso = ogr.Open('export.pdf')
print(dso)
dsg = gdal.Open('export.pdf')
print(dsg)
driverGeoPDFogr = ogr.GetDriverByName('PDF')
dso2 = driverGeoPDFogr.Open('export.pdf')
print(dso2)
I tried to run through the shell of OSGEO4W, but I had some problems with the python environment, more specifically an error in importing modules.
Error:
ImportError: Module use of python39.dll conflicts with this version of Python.
But running this code wouldn't be that important for the advancement of what I wanted to do, so I decided to leave it aside for now.
Then I tried to look the geospatial PDF import/export codes from QGIS and GDAL repository, but I was not successful in finding those codes.
I need to know if there is a way to capture the coordinates/extent of a geospatial PDF using javascript.