1

With libraries like iTextSharp or iText you can extract metadata from PDF documents via a PdfReader:

using (var reader = new PdfReader(pdfBytes))
{
    return reader.Metadata == null ? null : Encoding.UTF8.GetString(reader.Metadata);
}

These kind of libraries completely parse the PDF document before being able to soup up the metadata. This will, in my case, lead to high usage of system resources since we get many requests per second, with large PDF's.

Is there a way to extract the metadata from the PDF without completely loading it in memory first?

Michel van Engelen
  • 2,791
  • 2
  • 29
  • 45
  • Is the problem related to IO or parsing and interpreting (CPU/IO)? – Steeeve Nov 08 '21 at 19:39
  • You should examine your design. You state you get many requests per second and one would assume that you are searching in the PDFs at that time. Why are you not indexing this information when the PDFs are created or stored? – Kevin Brown Nov 09 '21 at 04:03
  • @Steeeve it's mainly a memory issue. Lots of Gen 2 gc's and pauses. – Michel van Engelen Nov 09 '21 at 07:22
  • @KevinBrown it's keeping the boat afloat with fixes till the services are overhauled in .Net Core. PDF's come in from our customers, we do not create them ourselves. – Michel van Engelen Nov 09 '21 at 07:24
  • Use the `PdfReader` in partial mode. Then only some core objects of the PDF are parsed. – mkl Nov 09 '21 at 08:53

2 Answers2

1

With PDF4NET you can extract the XMP metadata without loading the entire document in memory:

// This does a minimal parsing of the PDF file and loads 
// only a few objects from the file
PDFFile pdfFile = new PDFFile(new MemoryStream(pdfBytes));

string xmpMetadata = pdfFile.ExtractXmpMetadata();

Update 1: code changed to load the file from a byte array

Disclaimer: I work the for company that develops PDF4NET library.

iPDFdev
  • 5,229
  • 2
  • 17
  • 18
  • Considering the name `pdfBytes` the OP seems to work with a PDF in-memory, not in a file system. – mkl Nov 09 '21 at 08:52
1

iText 5.x allows partial reading of PDFs, too, it merely looks a bit more complicated.

Instead of

using (var reader = new PdfReader(pdfBytes))

use

using (var reader = new PdfReader(new RandomAccessFileOrArray(pdfBytes), null, true))

where the final true requests partial reading.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • 1
    I like this, it is already twice as fast. Is it also possible to get *custom* metadata from the pdf document instead of having to parse Encoding.UTF8.GetString(doc.GetXmpMetadata())? – Michel van Engelen Nov 09 '21 at 16:19
  • I'm not aware of such an option. In particular as with PDF 2.0 the XMP metadata have become the prime metadata source, one should look into these XMP metadata. – mkl Nov 09 '21 at 18:37