Questions tagged [apache-tika]

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types.

For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities.

While Tika is written in Java, it is widely used from other languages. The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

tika-framework

Related Tags:

1283 questions
5
votes
1 answer

Apache Tika App configuration file

I'm using Apache Tika App on my Ubuntu 16.04 Server as a comand line tool to extract content of documents. The [Apache Tika website][1] says the following: Build artifacts The Tika build consists of a number of components and produces the …
user164863
  • 580
  • 1
  • 12
  • 29
5
votes
1 answer

Using fallback font while parsing file content using pdfbox - can it cause mistakes?

I'm using Apache Tika 1.14 which uses pdfbox 2.0.3. I use it to extract text content of files. In production mode when processing many files I get in log many statements like these: WARN o.a.p.pdmodel.font.PDTrueTypeFont - Using fallback font…
user3151361
  • 109
  • 1
  • 1
  • 6
5
votes
1 answer

How to use Apache Tika on .Net Core?

I need to use .Net Core and create a console app that uses .NET bindings for Apache Tika. Do you guys have any idea on how to proceed? I found a wrapper called 'TikaOnDotNet' but it only seems to work with .Net Framework but not .Net Core. Is there…
javabeginner
  • 91
  • 3
  • 11
5
votes
2 answers

How to use Apache Tika on Android

I'm trying to use Apache tika to parse some documents but it giving me so many errors and warnings. build.gradle dependencies { ... compile ('org.apache.tika:tika-parsers:1.14'){ exclude group: 'org.json', module: 'json' …
X09
  • 3,827
  • 10
  • 47
  • 92
5
votes
1 answer

Indexing PDF with page numbers with Solr

I'm indexing PDFs with Solr using the ExtractingRequestHandler. I would like to display the page number along with hits in a document, e.g. "term foo was found in bar.pdf on pages 2, 3 and 5." Is it possible to include page numbers in the query…
Daniel Hepper
  • 28,981
  • 10
  • 72
  • 75
5
votes
2 answers

How to configure Apache Tika with apache Solr 1.4.1

I want to index a large number of pdf documents. I have found a reference showing that it could be done using Apache Tika but unfortunately I cannot find any reference that describes I could configure Apache Tika in Solr 1.4.1. Once configured I do…
Ahsan Iqbal
  • 1,422
  • 5
  • 20
  • 39
5
votes
1 answer

CSV Detector in Apache Tika

I'm using the Java library Tika by Apache (tika-core ver. 1.10). Exists a org.apache.tika.detect.Detector for CSV files? The MIME type should be text/csv, but I cannot find anything like that. I would like to use the nice detect method
mat_boy
  • 12,998
  • 22
  • 72
  • 116
5
votes
2 answers

How to check that file content really image

To detect real file type based on file content(rather than extension) I use apache Tika. I wrote following code: InputStream theInputStream = new FileInputStream("D:\\video.mp4"); try (InputStream is = theInputStream; …
gstackoverflow
  • 36,709
  • 117
  • 359
  • 710
5
votes
2 answers

How to use Tika via PHP when both installed on one server?

I need to make an internal website which allows users to upload .doc, .pdf, .xls files and see the text in a textarea box. I have created the site in PHP to the point where a user can upload the files. I have installed Tika on my server and at the…
Edward Tanguay
  • 189,012
  • 314
  • 712
  • 1,047
5
votes
2 answers

how to extract main text from html using Tika

I just want to know that how i can extract main text and plain text from html using Tika? maybe one possible solution is to use BoilerPipeContentHandler but do you have some sample/demo codes to show it? thanks very much in advance
user2651995
  • 63
  • 1
  • 5
5
votes
2 answers

Indexing PDF files with Symfony using Lucene

I am a Symfony developer and my web server is Linux. I already use the sfLucene plugin. What is the simplest way of indexing PDF files for search on a Linux PHP server? XPDF, installed like this Apache Tika via the SOLR sfLucene plugin branch A…
Jon Winstanley
  • 23,010
  • 22
  • 73
  • 116
5
votes
1 answer

Mimetype check using Tika jars

I am developing standard alone Java batch process. I am trying to determine file attachment mimetype using Tika Jars. I am using Tika 1.4 Jar files. My code look like Parser parser= new AutoDetectParser(); InputStream stream = new…
user2796000
5
votes
1 answer

Solr ExtractingRequestHandler extracting "rect" in links

I am utilizing solr ExtractingRequestHandler to extract and index HTML content. My issue comes to the extracted links section that it produces. The extracted content returned has "rect" inserted where they do not exist in the HTML source. I have…
jakelley
  • 76
  • 5
5
votes
0 answers

Tika 1.1 Performance Improvement

I am using tika 1.1, I am facing issue that tika is taking long time for extracting the content from file. For extracting 1MB of pdf/doc file it taking time around ~3Second. Is there any way to improve performance ? Any tuning ,configuration which…
Chetan Laddha
  • 993
  • 8
  • 22
5
votes
3 answers

How to get style information of elements in PDF using Apache Tika?

I am playing around with Apache Tika to extract text from PDF files. I would like to know how to get style information like font size, text color, whether specific piece of text (few words) are in Italics, Bold, etc. using Apache Tika? Is it even…
Shekhar
  • 11,438
  • 36
  • 130
  • 186