1

tldr: How can I convert a folder of pdfs into a list of CMYK values (or RGB or any kind of colour scale values), preferably in python.

I have a folder with around ~100,000 documents in it. To make sampling these documents easier I want to run data analysis on the documents (clustering and anomaly detection), and one metric I want to have is the CMYK coverage. Is there any method or package in (preferably) python that will calculate the CMYK coverage of the PDF?

****edit****

After some research I have found out that GhostScript should provide the functionality I require, if anyone could help me with the implementation I would still really appreciate it.

The Lemon
  • 1,211
  • 15
  • 26
  • 1
    The inkcov device will calculate the coverage of CMYK on each page of a document. I'm afraid I don't understand what it is you actually want though, you seem to want the CMYK coverage per document (rather then per page) which doesn't seem useful to me. I suppose you could total the coverage per page and divide by the number of pages to get an average. – KenS Jun 03 '18 at 03:20
  • Per page CMYK would work fine, I'm just struggling to get even that unfortunately – The Lemon Jun 03 '18 at 03:41

1 Answers1

1

./gs -sDEVICE=inkcov -sOutputFile=out.txt input.pdf should give you each page CMYK coverage in a file.

You could use -dQUIET -o - instead of -sOutputFile to send the output to stdout.

You then need some batch scripting which will depend on your Operating System. On Windows something like:

for %s in (folder/*.pdf) do gswin64c -dQUIET -sDEVICE=inkcov -o - "%s" >> coverage.txt

ought to take every file from the folder, run it through the inkcov device and send the output to stdout, which we redirect to a file and use >> so that each execution appends to the file instead of overwriting the previous output.

You will need to delete the output file after each run of course.

KenS
  • 30,202
  • 3
  • 34
  • 51