Questions tagged [apache-tika]

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types.

For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities.

While Tika is written in Java, it is widely used from other languages. The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

1283 questions

votes

1 answer

Apache Tika App configuration file

I'm using Apache Tika App on my Ubuntu 16.04 Server as a comand line tool to extract content of documents. The [Apache Tika website][1] says the following: Build artifacts The Tika build consists of a number of components and produces the …

configuration apache-tika

asked Jul 28 '18 at 15:18

user164863

votes

1 answer

Using fallback font while parsing file content using pdfbox - can it cause mistakes?

I'm using Apache Tika 1.14 which uses pdfbox 2.0.3. I use it to extract text content of files. In production mode when processing many files I get in log many statements like these: WARN o.a.p.pdmodel.font.PDTrueTypeFont - Using fallback font…

pdfbox apache-tika

asked May 22 '17 at 16:22

user3151361

votes

1 answer

How to use Apache Tika on .Net Core?

I need to use .Net Core and create a console app that uses .NET bindings for Apache Tika. Do you guys have any idea on how to proceed? I found a wrapper called 'TikaOnDotNet' but it only seems to work with .Net Framework but not .Net Core. Is there…

.net .net-core apache-tika

asked Feb 28 '17 at 21:42

javabeginner

votes

2 answers

How to use Apache Tika on Android

I'm trying to use Apache tika to parse some documents but it giving me so many errors and warnings. build.gradle dependencies { ... compile ('org.apache.tika:tika-parsers:1.14'){ exclude group: 'org.json', module: 'json' …

android apache-tika

asked Feb 19 '17 at 01:45

X09

3,827
10
47
92

votes

1 answer

Indexing PDF with page numbers with Solr

I'm indexing PDFs with Solr using the ExtractingRequestHandler. I would like to display the page number along with hits in a document, e.g. "term foo was found in bar.pdf on pages 2, 3 and 5." Is it possible to include page numbers in the query…

pdf solr full-text-search apache-tika solr-cell

asked Nov 04 '10 at 06:05

Daniel Hepper

28,981
10
72
75

votes

2 answers

How to configure Apache Tika with apache Solr 1.4.1

I want to index a large number of pdf documents. I have found a reference showing that it could be done using Apache Tika but unfortunately I cannot find any reference that describes I could configure Apache Tika in Solr 1.4.1. Once configured I do…

solr solrnet apache-tika solr-cell

asked Oct 05 '10 at 13:09

Ahsan Iqbal

1,422
5
20
39

votes

1 answer

CSV Detector in Apache Tika

I'm using the Java library Tika by Apache (tika-core ver. 1.10). Exists a org.apache.tika.detect.Detector for CSV files? The MIME type should be text/csv, but I cannot find anything like that. I would like to use the nice detect method

java csv apache-tika

asked Aug 21 '15 at 09:34

mat_boy

12,998
22
72
116

votes

2 answers

How to check that file content really image

To detect real file type based on file content(rather than extension) I use apache Tika. I wrote following code: InputStream theInputStream = new FileInputStream("D:\\video.mp4"); try (InputStream is = theInputStream; …

java file-type apache-tika

asked Jul 11 '15 at 20:17

gstackoverflow

36,709
117
359
710

votes

2 answers

How to use Tika via PHP when both installed on one server?

I need to make an internal website which allows users to upload .doc, .pdf, .xls files and see the text in a textarea box. I have created the site in PHP to the point where a user can upload the files. I have installed Tika on my server and at the…

php apache-tika

asked Jun 04 '15 at 14:09

Edward Tanguay

189,012
314
712
1,047

votes

2 answers

how to extract main text from html using Tika

I just want to know that how i can extract main text and plain text from html using Tika? maybe one possible solution is to use BoilerPipeContentHandler but do you have some sample/demo codes to show it? thanks very much in advance

html-parsing apache-tika boilerpipe

asked May 14 '14 at 11:14

user2651995

votes

2 answers

Indexing PDF files with Symfony using Lucene

I am a Symfony developer and my web server is Linux. I already use the sfLucene plugin. What is the simplest way of indexing PDF files for search on a Linux PHP server? XPDF, installed like this Apache Tika via the SOLR sfLucene plugin branch A…

full-text-search lucene symfony1 apache-tika

asked Feb 19 '10 at 12:43

Jon Winstanley

23,010
22
73
116

votes

1 answer

Mimetype check using Tika jars

I am developing standard alone Java batch process. I am trying to determine file attachment mimetype using Tika Jars. I am using Tika 1.4 Jar files. My code look like Parser parser= new AutoDetectParser(); InputStream stream = new…

java apache-poi apache-tika

asked Mar 06 '14 at 13:20

user2796000

votes

1 answer

Solr ExtractingRequestHandler extracting "rect" in links

I am utilizing solr ExtractingRequestHandler to extract and index HTML content. My issue comes to the extracted links section that it produces. The extracted content returned has "rect" inserted where they do not exist in the HTML source. I have…

solr apache-tika solr-cell

asked Mar 04 '14 at 17:21

jakelley

votes

0 answers

Tika 1.1 Performance Improvement

I am using tika 1.1, I am facing issue that tika is taking long time for extracting the content from file. For extracting 1MB of pdf/doc file it taking time around ~3Second. Is there any way to improve performance ? Any tuning ,configuration which…

java apache-tika data-extraction

asked Dec 23 '13 at 14:43

Chetan Laddha

votes

3 answers

How to get style information of elements in PDF using Apache Tika?

I am playing around with Apache Tika to extract text from PDF files. I would like to know how to get style information like font size, text color, whether specific piece of text (few words) are in Italics, Bold, etc. using Apache Tika? Is it even…

pdf pdfbox apache-tika

asked Oct 07 '13 at 15:48

Shekhar

11,438
36
130
186

Prev 1 2 3

…

85 86 Next