Questions tagged [apache-tika]

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types.

For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities.

While Tika is written in Java, it is widely used from other languages. The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

1283 questions

votes

1 answer

Apache Tika OCR without Tesseract installing

I am using Apache Tika Parser to parse PDF files into text. Some PDFs could contain scanned documents. Apache Tika uses Tesseract to recognize a text into images. But there is no jar library with Tesseract and user should install Tesseract as…

java ocr tesseract apache-tika

asked Sep 16 '17 at 12:24

Nox

votes

1 answer

ImportError: cannot import name parser with tika-python

Done with : java -jar tika-server-path --port xxxx pip install tika (virtualenv) parser-tika.py import tika from tika import parser parsed = parser.from_file('/path/to/file') print parsed["metadata"] print parsed["content"] error : …

python apache-tika

asked Oct 02 '16 at 14:22

Aswin

votes

2 answers

How to detect that mime type is for executable file?

I am using Apache Tika to detect the mime type of an input stream and I was wondering if there's a ready method to detect that this file is an executable file, there's a big list of executable files mime types…

java mime-types apache-tika

asked Feb 23 '16 at 05:37

Mahmoud Saleh

33,303
119
337
498

votes

0 answers

Parsing emails using Tika

I'm looking to parse an email .msg or .eml file using Tika. With the code below, I'm able to parse the email along with what is is inside of the attachment. However, I'd like to get the attachment text and name in a different object. Is this…

java apache-tika

asked Nov 04 '15 at 15:47

Shak Ham

1,209
2
9
27

votes

4 answers

parse tables from a PDF document

The PDF in this link (http://www.lenovo.com/psref/pdf/psref450.pdf) contains a number of tables like this: I'd like to programmatically extract the data and the structure from these tables. Things I've tried: converting the PDF to HTML using Tika:…

python parsing pdf pdfbox apache-tika

asked Mar 24 '14 at 21:40

Alex Woolford

4,433
11
47
80

votes

2 answers

Files locked after indexing

I have the following workflow in my (web)application: download a pdf file from an archive index the file delete the file My problem is that after indexing the file, it remains locked and the delete-part throws an exception. Here is my code-snippet…

solrj solr4 apache-tika

asked Feb 26 '14 at 08:33

Francesco

2,350
11
36
59

votes

2 answers

Apache Tika and Json

When I use Apache Tika to determine the file type from the content. XML file is fine but not the json. If content type is json, it will return "text/plain" instead of "application/json". Any help? public static String tiKaDetectMimeType(final File…

json apache-tika

asked Oct 17 '13 at 06:31

songjing

votes

1 answer

Solr for Arabic PDF's

I am trying to search arabic PDFs in Apache Solr. The problem appears to be that Tika indexes the PDF in reverse order (Left-to-right) instead of (Right-to-left). I have found references about this problem here: Solr for Arabic How to parse arabic…

drupal solr arabic right-to-left apache-tika

asked Nov 27 '12 at 17:27

perpetual_dream

1,046
5
18
51

votes

1 answer

Customising the search algorithm of Elasticsearch

I originally tried posting a similar post to the elasticsearch mailing list (https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/BZLFJSEpl78) but didn't get any helpful responses so I though I'd give Stack Overflow a try. This is my…

java lucene elasticsearch apache-tika

asked Oct 02 '12 at 00:25

rstuart85

2,035
2
15
19

votes

1 answer

how can I detect farsi web pages by tika?

I need a sample code to help me detect farsi language web pages by apache tika toolkit. LanguageIdentifier identifier = new LanguageIdentifier("فارسی"); String language = identifier.getLanguage(); I have download apache.tika jar files and…

java apache apache-tika language-detection farsi

asked Jan 28 '12 at 11:30

aliakbarian

votes

0 answers

Why is Tika's ForkParser throwing a NoClassDefFoundError when Autodetect parser seems to work fine?

I'm using apache Tika 1.0. Using ForkParser, whenever I parse pdf files, I get the following NoClassDefFoundException: java.lang.NoClassDefFoundError: org/apache/tika/fork/MemoryURLStreamHandler$Record at…

java parsing fork noclassdeffounderror apache-tika

asked Dec 08 '11 at 00:50

anchovie

votes

1 answer

Is there a best practice schema.xml for SOLR when importing rich documents?

I'm working with SOLR on a project where we import a bunch (~40k items) of rich documents, mainly MS Word, Powerpoint, Excel and PDFs. Is there a best practice schema.xml and/or solrconfig.xml to use in SOLR when using the ExtractingRequestHandler?…

solr lucene full-text-search apache-tika solr-cell

asked Dec 05 '11 at 23:31

Pål Brattberg

4,568
29
40

votes

3 answers

Alternative to Tika/PDFBox for parsing PDF in Solr (any version later than 1.4)

Seems like Solr is not parsing my PDF files correctly. I was wondering if there is any other alternative to using Apache Tika (which I believe uses PDFBox internally) for parsing PDF files? I seem to be getting random spaces in between my content…

solr full-text-indexing pdfbox apache-tika document-conversion

asked Nov 16 '11 at 09:14

Ravish Bhagdev

votes

1 answer

Spark 2.x + Tika: java.lang.NoSuchMethodError: org.apache.commons.compress.archivers.ArchiveStreamFactory.detect

I am trying to resolve a spark-submit classpath runtime issue for an Apache Tika (>v 1.14) parsing job. The problem seems to involve spark-submit classpath vs my uber-jar. Platforms: CDH 5.15 (Spark 2.3 added via CDH docs) and CDH 6 (Spark 2.2…

apache-spark apache-tika cloudera-cdh

asked Sep 25 '18 at 19:51

WouldRatherBeSwimming

votes

0 answers

How to configure google vision api with tika parser

I am trying to parse images using the Apache tika-parser in python, but sometimes I get content as "none". But when I try the same image with Google the vision API it gives me a good response. Is it possible to integrate tika with Google vision API?…

python-3.x apache apache-tika google-vision

asked Aug 09 '18 at 13:07

Manmohan

Prev 1 2 3

…

85 86 Next