Questions tagged [apache-tika]

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types.

For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities.

While Tika is written in Java, it is widely used from other languages. The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

tika-framework

Related Tags:

1283 questions
7
votes
4 answers

Extract the text from URLs using TIKA

Is it possible to extract text from URLs with Tika? Any links will be appreciated. Or TIKA is usable only for pdf, word and any other media documents?
arsenal
  • 23,366
  • 85
  • 225
  • 331
7
votes
1 answer

Adding language profile to Apache Tika

Could please anybody who managed to do that explain how to do that :-) Do I need to get n-gram files for the language I need to add ? Is it a matter of creating tika.language.override.properties, add some other lang codes and add lang-code.ngp …
lisak
  • 21,611
  • 40
  • 152
  • 243
7
votes
1 answer

tika solr integration

I am trying to index using curl based request the request is curl "http://localhost:8080/solr1/update/extract?literal.id=who.pdf&uprefix=attr_&fmap.content=attr_content&commit=true" -F "myfile=@/root/apache-solr-3.1.0/docs/who.pdf" On submitting…
naveen gupta
  • 71
  • 1
  • 4
7
votes
1 answer

Apache Tika exclude some html tags

I am testing Apache Tika REST Api via python for parsing HTML files. Everything works except one thing. Interior of
7
votes
2 answers

How can I use the HTML parser with Apache Tika in Java to extract all HTML tags?

I download tika-core and tika-parser libraries, but I could not find the example codes to parse HTML documents to string. I have to get rid of all html tags of source of a web page. What can I do? How do I code that using Apache Tika?
lkalay
  • 89
  • 1
  • 1
  • 10
7
votes
1 answer

Apache Tika and document metadata

I'm doing simple processing of variety of documents (ODS, MS office, pdf) using Apache Tika. I have to get at least : word count, author, title, timestamps, language etc. which is not so easy. My strategy is using Template method pattern for 6…
lisak
  • 21,611
  • 40
  • 152
  • 243
7
votes
1 answer

Java utility library for Nested ZIP file handling

I am aware that Oracle notes ZIP/GZIP file compressor/decompressor methods on their website. But I have a scenario where I need to scan and find out whether any nested ZIPs/RARs are involved. For example, the following case: -MyFiles.zip …
ha9u63a7
  • 6,233
  • 16
  • 73
  • 108
7
votes
3 answers

How do I index documents in SOLR?

Im running Solr 1.4 on Ubuntu 10.04 (installed via apt-get solr-tomcat) and it seems to be working fine. Im having some difficulty finding any coherent info on how to index documents though. Im new to SOLR so bear with me! I have a folder…
Shane
  • 71
  • 1
  • 1
  • 2
7
votes
0 answers

Handle ligatures in Apache Tika

Tika doesn't seem to recognize ligatures (fi, ff, fl...) in PDF files and replaces them with question marks. Any idea (not only on Tika) to extract PDF text while converting character ligatures to separated characters ? File file = new…
Spadon_
  • 495
  • 2
  • 5
  • 11
7
votes
2 answers

Difference between Apache POI api and Apache Tika Api?

I had requirement to extract specific colums/rows from Excel/CSV file. Somebody suggest me to using Tika for this task. While going thru tika, I came across POI API and found more friendly to use it. we may have requirement to parse PDF file in…
Krishna
  • 486
  • 8
  • 20
6
votes
0 answers

Regarding No Unicode mapping error while parsing pdf

I have bunch of pdf files (from different sources) and I'd like to extract text from them (unfortunately can't attach the files). Current parsing outcome: Tika silently returns text, which is missing a lot of needed data. Using PDFBox directly…
exenza
  • 966
  • 10
  • 21
6
votes
1 answer

Apache Tika Server - Request Header Parameters?

The Apache Tika Server provides a Rest API to extract text from a document. It is also possible to set specific request header parameters like X-Tika-PDFOcrStrategy. e.g: $ curl -T test/Dokument01.pdf http://localhost:9998/tika --header…
Ralph
  • 4,500
  • 9
  • 48
  • 87
6
votes
2 answers

Apache Tika and File access instead of Java Input Stream

I want to be able to create a new Tika parser to extract metadata from a file. We're already using Tika and the metadata extraction will be done consistently. I think that I've run into this problem/enhancement request for Tika: Allow passing of…
George
  • 211
  • 5
  • 12
6
votes
1 answer

Detect if file is password protected without loading it into memory?

There are some existing posts out there that talk about "how to detect if a document is password protected". This is probably the most comprehensive of these links for MS Office docs: Detecting a password-protected document (The code is written in…
Nicholas DiPiazza
  • 10,029
  • 11
  • 83
  • 152
6
votes
1 answer

Warning message from tika python module using the unpack method

I'm currently using tika to extract the text from pdf files. I found a very fast method within the tika module. This method is called unpack. This is my code: from tika import unpack text = unpack.from_file('example.pdf')['content'] However, once…
teller.py3
  • 822
  • 8
  • 22
1 2
3
85 86