1

I've a django site that parses pdf using tika-python and stores the parsed pdf content in elasticsearch index. it works fine in my local machine. I want to run this setup using docker. However, tika-python does not work as it requires java 8 to run the REST server in background.

my dockerfile:

FROM python:3.6.5

WORKDIR /site
COPY requirements.txt ./
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8000
EXPOSE 9200
ENV PATH="/site/poppler/bin:${PATH}"
CMD ["python", "manage.py", "runserver", "0.0.0.0:8000"]

requirements.txt file :

django==2.2
beautifulsoup4==4.6.0
json5==0.8.4
jsonschema==2.6.0
django-elasticsearch-dsl==0.5.1
tika==1.19
sklearn

where (dockerfile or requirements) and how should i add java 8 required for tika to make it work in docker. Online tutorials/ examples contain java+tika in container, which is easy to achieve. Unfortunately, couldn't find a similar solution in stackoverflow also.

Irfan Harun
  • 979
  • 2
  • 16
  • 37
  • 1
    Why you can't use the tika natively integration in elasticsearch? https://www.elastic.co/guide/en/elasticsearch/plugins/master/ingest-attachment.html – Lupanoide Aug 29 '19 at 07:29
  • That is one of the do-able options @Lupanoide . What i did was i realized that everytime i started the container and called my function to initiate tika server, a .jar file (~45mb) was downloaded which was causing the issue. I downloaded it on my local machine and copied it into the container once it was up and running. That resolved atleast the starting problems for me. – Irfan Harun Aug 29 '19 at 08:04

0 Answers0