I am trying to use the Linux command-line tool 'Poppler' to extract information from pdf files. I want to do this for a huge amount of PDFs on several Spark workers. I need to use Popplers, not PyPDF or anything alike.
Does anybody know how to install Poppler on the workers? I know that I can do command-line calls from within python, and fetch the output (or fetch the generated file by the Poppler lib), but how do I install it on each worker? Im using spark 1.3.1 (databricks).
Thank you!