Questions tagged [apache-tika]

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types.

For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities.

While Tika is written in Java, it is widely used from other languages. The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

tika-framework

Related Tags:

1283 questions
6
votes
1 answer

Apache Tika OCR without Tesseract installing

I am using Apache Tika Parser to parse PDF files into text. Some PDFs could contain scanned documents. Apache Tika uses Tesseract to recognize a text into images. But there is no jar library with Tesseract and user should install Tesseract as…
Nox
  • 191
  • 1
  • 11
6
votes
1 answer

ImportError: cannot import name parser with tika-python

Done with : java -jar tika-server-path --port xxxx pip install tika (virtualenv) parser-tika.py import tika from tika import parser parsed = parser.from_file('/path/to/file') print parsed["metadata"] print parsed["content"] error : …
Aswin
  • 349
  • 4
  • 16
6
votes
2 answers

How to detect that mime type is for executable file?

I am using Apache Tika to detect the mime type of an input stream and I was wondering if there's a ready method to detect that this file is an executable file, there's a big list of executable files mime types…
Mahmoud Saleh
  • 33,303
  • 119
  • 337
  • 498
6
votes
0 answers

Parsing emails using Tika

I'm looking to parse an email .msg or .eml file using Tika. With the code below, I'm able to parse the email along with what is is inside of the attachment. However, I'd like to get the attachment text and name in a different object. Is this…
Shak Ham
  • 1,209
  • 2
  • 9
  • 27
6
votes
4 answers

parse tables from a PDF document

The PDF in this link (http://www.lenovo.com/psref/pdf/psref450.pdf) contains a number of tables like this: I'd like to programmatically extract the data and the structure from these tables. Things I've tried: converting the PDF to HTML using Tika:…
Alex Woolford
  • 4,433
  • 11
  • 47
  • 80
6
votes
2 answers

Files locked after indexing

I have the following workflow in my (web)application: download a pdf file from an archive index the file delete the file My problem is that after indexing the file, it remains locked and the delete-part throws an exception. Here is my code-snippet…
Francesco
  • 2,350
  • 11
  • 36
  • 59
6
votes
2 answers

Apache Tika and Json

When I use Apache Tika to determine the file type from the content. XML file is fine but not the json. If content type is json, it will return "text/plain" instead of "application/json". Any help? public static String tiKaDetectMimeType(final File…
songjing
  • 545
  • 4
  • 22
6
votes
1 answer

Solr for Arabic PDF's

I am trying to search arabic PDFs in Apache Solr. The problem appears to be that Tika indexes the PDF in reverse order (Left-to-right) instead of (Right-to-left). I have found references about this problem here: Solr for Arabic How to parse arabic…
perpetual_dream
  • 1,046
  • 5
  • 18
  • 51
6
votes
1 answer

Customising the search algorithm of Elasticsearch

I originally tried posting a similar post to the elasticsearch mailing list (https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/BZLFJSEpl78) but didn't get any helpful responses so I though I'd give Stack Overflow a try. This is my…
rstuart85
  • 2,035
  • 2
  • 15
  • 19
5
votes
1 answer

how can I detect farsi web pages by tika?

I need a sample code to help me detect farsi language web pages by apache tika toolkit. LanguageIdentifier identifier = new LanguageIdentifier("فارسی"); String language = identifier.getLanguage(); I have download apache.tika jar files and…
aliakbarian
  • 709
  • 1
  • 11
  • 20
5
votes
0 answers

Why is Tika's ForkParser throwing a NoClassDefFoundError when Autodetect parser seems to work fine?

I'm using apache Tika 1.0. Using ForkParser, whenever I parse pdf files, I get the following NoClassDefFoundException: java.lang.NoClassDefFoundError: org/apache/tika/fork/MemoryURLStreamHandler$Record at…
anchovie
  • 115
  • 5
5
votes
1 answer

Is there a best practice schema.xml for SOLR when importing rich documents?

I'm working with SOLR on a project where we import a bunch (~40k items) of rich documents, mainly MS Word, Powerpoint, Excel and PDFs. Is there a best practice schema.xml and/or solrconfig.xml to use in SOLR when using the ExtractingRequestHandler?…
Pål Brattberg
  • 4,568
  • 29
  • 40
5
votes
3 answers

Alternative to Tika/PDFBox for parsing PDF in Solr (any version later than 1.4)

Seems like Solr is not parsing my PDF files correctly. I was wondering if there is any other alternative to using Apache Tika (which I believe uses PDFBox internally) for parsing PDF files? I seem to be getting random spaces in between my content…
5
votes
1 answer

Spark 2.x + Tika: java.lang.NoSuchMethodError: org.apache.commons.compress.archivers.ArchiveStreamFactory.detect

I am trying to resolve a spark-submit classpath runtime issue for an Apache Tika (>v 1.14) parsing job. The problem seems to involve spark-submit classpath vs my uber-jar. Platforms: CDH 5.15 (Spark 2.3 added via CDH docs) and CDH 6 (Spark 2.2…
5
votes
0 answers

How to configure google vision api with tika parser

I am trying to parse images using the Apache tika-parser in python, but sometimes I get content as "none". But when I try the same image with Google the vision API it gives me a good response. Is it possible to integrate tika with Google vision API?…
Manmohan
  • 373
  • 1
  • 2
  • 14