Questions tagged [apache-tika]

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types.

For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities.

While Tika is written in Java, it is widely used from other languages. The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

1283 questions

votes

0 answers

Tika 1.2 PDF parse error - org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary

I am using Solr 4.0 and DIH (data import handler) with TikaProcessor for extracting text from PDF files stored in database. When I run indexing it gets failed to parse some PDF files and got the stack trace mentioned below. Since Solr 4.0 uses Tika…

solr solrj pdfbox apache-tika

asked Feb 13 '13 at 07:29

Phani Kumar

votes

1 answer

Can Solr retain the formatting of the HTML documents whcih was fed to it in its result?

How do I maintain the Original formatting of the HTML document in the results given by Solr? I am trying to provide search functionality in one of my companies website that is having millions of documents and all are not having similar formatting,…

solr solrj apache-tika solr-cell

asked Feb 08 '13 at 10:34

Mantra

votes

1 answer

Tika fetches the binary content stored in database but does not indexes it

I am trying to parse the binary content data stored in database in table document_attachment in column file_data and trying to index the same so that it's content becomes available for searching using Solr. When I run the indexer it fetches the rows…

solr binaryfiles apache-tika

asked Feb 03 '13 at 10:21

Chhavi Gangwal

1,166
9
13

votes

1 answer

NoClassDefFoundError errors in Sling logs when uploading docx, xslx, pptx

I am getting the below multiple errors (see below - one per file) when uploading any office 2007 docs (e.g. pptx, docx, xslx) into Sling. I am using Sling 6 stable standalone. Is anyone else experiencing this? Are there any known issues with the…

java apache apache-tika sling jackrabbit

asked Jan 24 '13 at 22:36

NabilS

1,421
1
19
31

votes

1 answer

TIKA parsing feedback

Does a list exist of what types of feedback TIKA can provide about files it cannot parse? I'm trying to decide whether or not to provide end user feedback or feedback for an operations team or both based on what TIKA can tell me. For example if a MS…

solr apache-tika

asked Jan 22 '13 at 11:24

user195166

votes

1 answer

python detect image in a document

How can I detect images in a document say doc,xls,ppt or pdf ? I came across with Apache Tika, I am trying command line option. http://tika.apache.org/1.2/gettingstarted.html I am using Python2.7.. But not quite sure how it will detect images. i am…

python apache-tika

asked Jan 22 '13 at 10:12

user1839132

votes

1 answer

Apache Tika : parsing visio files (.vsd)

I'm currently writing a program in Java to extract metadata from multiple document type. At the moment I'm trying to extract metadata from .vsd files using Apache Tika. I previously tried using Apache POI directly, but the fact is it's very hard to…

java apache visio apache-tika

asked Jan 17 '13 at 22:53

Bdloul

votes

3 answers

Configure apache solr3.6 with tika1.2

I am using solr3.6 with tika1.2 but I can't upload pdf files. First I install solr and upload some *.xml files from the exampledocs. This files I could search with this URL http://localhost:8983/solr/select/?q=solr. And in the next step I install…

linux ubuntu solr lucene apache-tika

asked Nov 13 '12 at 20:26

henning

votes

1 answer

Eclipse Juno EE NoClassDefFoundError when using external Jar

I added an external jar in my eclipse dynamic webproject via Folder -> properties -> build path -> Libraries -> add external jar. The code is working fine on compile time. package servlet; import java.io.IOException; import…

apache jakarta-ee eclipse-juno apache-tika

asked Nov 04 '12 at 10:08

user962206

15,637
61
177
270

votes

1 answer

Solr - Multiple attachments under one Data Import Handler record

I'm using Data Import Handler (DIH) to create documents in solr. Each document will have zero or more attachments. The attachments' (e.g. PDFs, Word docs, etc.) content is parsed (via Tika) and stored along with a path to the attachment. The…

solr solrj apache-tika dataimporthandler

asked Oct 24 '12 at 22:43

James

2,876
18
72
116

votes

1 answer

Solr - Tika - Parsing Content to Enable Highlighting

My understanding is that indexing a PDF, Word, Excel, etc. document through Solr will allow searching but not highlighting. I have this code to perform the indexing: String urlString = "http://localhost:8983/solr"; SolrServer solr =…

solr highlighting apache-tika

asked Oct 09 '12 at 16:13

James

2,876
18
72
116

votes

1 answer

Solr Tika XPath Exception

I'm trying to index an HTML document using Apache Solr and the TikaEntityProcessor, with the idea being that I can use XPath to select specific elements from the HTML. I have followed the advanced example shown at the bottom of the…

xpath solr nullpointerexception apache-tika

asked Oct 03 '12 at 15:43

Sam Delaney

1,305
11
10

votes

1 answer

Tika exception error while indexing rich documents rails 3

Well I am just implementing full text search in rich documents using sunspot_cell. I am using paperclip for attachment. I have done all the required configurations and include all the *.jar files in solr/lib dir. But its not able index the document.…

ruby-on-rails-3 sunspot-solr apache-tika

asked Sep 27 '12 at 11:07

Karan Nanda

votes

0 answers

Extracting images from HTML from
using Tika

I have the following xhtml file, which contains about 30-40 images. The file is auto-generated and the numbers of the image will change, but the {html text} content which should really be do not change. I was hoping someone could point me in the…

java apache xhtml apache-tika

asked Sep 15 '12 at 17:08

awm

2,723
2
18
26

votes

1 answer

Apache Tika alternatives for ios

I know that Apache Tika is a text extractor. It can extract text from doc, pdf, ppt and lots of other file formats. Now I need this function in ios, so I want to know is there any alternative to Apache Tika for ios? If there is no such library for…

ios apache-tika

asked Sep 05 '12 at 11:30

jjyao

Prev 1 2 3

…

85 86 Next