Convert a folder of PDFs into a csv of CMYK values

Question

tldr: How can I convert a folder of pdfs into a list of CMYK values (or RGB or any kind of colour scale values), preferably in python.

I have a folder with around ~100,000 documents in it. To make sampling these documents easier I want to run data analysis on the documents (clustering and anomaly detection), and one metric I want to have is the CMYK coverage. Is there any method or package in (preferably) python that will calculate the CMYK coverage of the PDF?

****edit****

After some research I have found out that GhostScript should provide the functionality I require, if anyone could help me with the implementation I would still really appreciate it.

The inkcov device will calculate the coverage of CMYK on each page of a document. I'm afraid I don't understand what it is you actually want though, you seem to want the CMYK coverage per document (rather then per page) which doesn't seem useful to me. I suppose you could total the coverage per page and divide by the number of pages to get an average. — KenS, Jun 03 '18 at 03:20
Per page CMYK would work fine, I'm just struggling to get even that unfortunately — The Lemon, Jun 03 '18 at 03:41

KenS · Accepted Answer · 2018-06-03T14:30:05.833

./gs -sDEVICE=inkcov -sOutputFile=out.txt input.pdf should give you each page CMYK coverage in a file.

You could use -dQUIET -o - instead of -sOutputFile to send the output to stdout.

You then need some batch scripting which will depend on your Operating System. On Windows something like:

for %s in (folder/*.pdf) do gswin64c -dQUIET -sDEVICE=inkcov -o - "%s" >> coverage.txt

ought to take every file from the folder, run it through the inkcov device and send the output to stdout, which we redirect to a file and use >> so that each execution appends to the file instead of overwriting the previous output.

You will need to delete the output file after each run of course.

Convert a folder of PDFs into a csv of CMYK values

1 Answers1