Questions tagged [apache-tika]

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types.

For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities.

While Tika is written in Java, it is widely used from other languages. The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

1283 questions

votes

4 answers

Extract the text from URLs using TIKA

Is it possible to extract text from URLs with Tika? Any links will be appreciated. Or TIKA is usable only for pdf, word and any other media documents?

java apache-tika

asked Jul 11 '11 at 21:30

arsenal

23,366
85
225
331

votes

1 answer

Adding language profile to Apache Tika

Could please anybody who managed to do that explain how to do that :-) Do I need to get n-gram files for the language I need to add ? Is it a matter of creating tika.language.override.properties, add some other lang codes and add lang-code.ngp …

java apache-tika language-detection

asked Jun 03 '11 at 13:16

lisak

21,611
40
152
243

votes

1 answer

tika solr integration

I am trying to index using curl based request the request is curl "http://localhost:8080/solr1/update/extract?literal.id=who.pdf&uprefix=attr_&fmap.content=attr_content&commit=true" -F "myfile=@/root/apache-solr-3.1.0/docs/who.pdf" On submitting…

solr full-text-search apache-tika solr-cell

asked May 31 '11 at 11:28

naveen gupta

votes

1 answer

Apache Tika exclude some html tags

I am testing Apache Tika REST Api via python for parsing HTML files. Everything works except one thing. Interior of

python apache-tika

asked Feb 22 '19 at 15:00

Bociek

1,195
2
13
28

votes

2 answers

How can I use the HTML parser with Apache Tika in Java to extract all HTML tags?

I download tika-core and tika-parser libraries, but I could not find the example codes to parse HTML documents to string. I have to get rid of all html tags of source of a web page. What can I do? How do I code that using Apache Tika?

java html apache apache-tika

asked Mar 25 '11 at 07:47

lkalay

votes

1 answer

Apache Tika and document metadata

I'm doing simple processing of variety of documents (ODS, MS office, pdf) using Apache Tika. I have to get at least : word count, author, title, timestamps, language etc. which is not so easy. My strategy is using Template method pattern for 6…

java apache metadata documents apache-tika

asked Feb 26 '11 at 21:47

lisak

21,611
40
152
243

votes

1 answer

Java utility library for Nested ZIP file handling

I am aware that Oracle notes ZIP/GZIP file compressor/decompressor methods on their website. But I have a scenario where I need to scan and find out whether any nested ZIPs/RARs are involved. For example, the following case: -MyFiles.zip …

java recursion zip apache-tika apache-commons-compress

asked Feb 11 '16 at 10:34

ha9u63a7

6,233
16
73
108

votes

3 answers

How do I index documents in SOLR?

Im running Solr 1.4 on Ubuntu 10.04 (installed via apt-get solr-tomcat) and it seems to be working fine. Im having some difficulty finding any coherent info on how to index documents though. Im new to SOLR so bear with me! I have a folder…

solr full-text-search apache-tika solr-cell

asked May 10 '10 at 10:48

Shane

votes

0 answers

Handle ligatures in Apache Tika

Tika doesn't seem to recognize ligatures (fi, ff, fl...) in PDF files and replaces them with question marks. Any idea (not only on Tika) to extract PDF text while converting character ligatures to separated characters ? File file = new…

java pdf character-encoding apache-tika ligature

asked Mar 12 '14 at 10:30

Spadon_

votes

2 answers

Difference between Apache POI api and Apache Tika Api?

I had requirement to extract specific colums/rows from Excel/CSV file. Somebody suggest me to using Tika for this task. While going thru tika, I came across POI API and found more friendly to use it. we may have requirement to parse PDF file in…

java apache-poi apache-tika

asked Sep 19 '13 at 06:47

Krishna

votes

0 answers

Regarding No Unicode mapping error while parsing pdf

I have bunch of pdf files (from different sources) and I'd like to extract text from them (unfortunately can't attach the files). Current parsing outcome: Tika silently returns text, which is missing a lot of needed data. Using PDFBox directly…

parsing unicode pdfbox apache-tika pdf-parsing

asked Aug 06 '20 at 04:17

exenza

votes

1 answer

Apache Tika Server - Request Header Parameters?

The Apache Tika Server provides a Rest API to extract text from a document. It is also possible to set specific request header parameters like X-Tika-PDFOcrStrategy. e.g: $ curl -T test/Dokument01.pdf http://localhost:9998/tika --header…

apache-tika tika-server

asked May 25 '20 at 21:26

Ralph

4,500
9
48
87

votes

2 answers

Apache Tika and File access instead of Java Input Stream

I want to be able to create a new Tika parser to extract metadata from a file. We're already using Tika and the metadata extraction will be done consistently. I think that I've run into this problem/enhancement request for Tika: Allow passing of…

java file inputstream apache-tika

asked May 17 '11 at 21:32

George

votes

1 answer

Detect if file is password protected without loading it into memory?

There are some existing posts out there that talk about "how to detect if a document is password protected". This is probably the most comprehensive of these links for MS Office docs: Detecting a password-protected document (The code is written in…

java apache-tika

asked Sep 18 '19 at 17:21

Nicholas DiPiazza

10,029
11
83
152

votes

1 answer

Warning message from tika python module using the unpack method

I'm currently using tika to extract the text from pdf files. I found a very fast method within the tika module. This method is called unpack. This is my code: from tika import unpack text = unpack.from_file('example.pdf')['content'] However, once…

python python-3.x apache-tika tika-server

asked Nov 02 '18 at 16:07

teller.py3

Prev 1 2

…

85 86 Next