0

How can I detect images in a document say doc,xls,ppt or pdf ?

I came across with Apache Tika, I am trying command line option. http://tika.apache.org/1.2/gettingstarted.html

I am using Python2.7..

But not quite sure how it will detect images.

i am newbie to Django, Any help is appreciated.

Thanks

user1839132
  • 121
  • 2
  • 10
  • Decide on a definitive list of file formats to support, then tackle each one individually. As a start, microsofts formats are all zip files. So those you can check if there is a non-empty image directory in the archive. – kalhartt Jan 23 '13 at 04:11
  • @kalhartt : is there any other way then apache-tika to detect whether image is present in pdf or not (pure python) – user1839132 Jan 23 '13 at 07:04
  • [Python-tika](http://redmine.djity.net/projects/pythontika/wiki) might be of use to you, although the docs don't seem so complete. Without Tika [PDFMiner](http://www.unixuser.org/~euske/python/pdfminer/index.html) could do the job. – kalhartt Jan 23 '13 at 12:55

1 Answers1

0

This thread is old and I am reviving it because there are various solutions now to this problem. Chris Mathamm, one of the developers of tika has made a python integration for tika that uses the JCC library's c++ bindings to access the jvm and run tika. You can find that here.

There is also a the Apache Tika integration for Plone using portal transforms. Which uses the tika-jaxrs server to parse documents.

kslote1
  • 720
  • 6
  • 15