Questions tagged [apache-tika]

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types.

For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities.

While Tika is written in Java, it is widely used from other languages. The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

tika-framework

Related Tags:

1283 questions
0
votes
0 answers

Tika 1.2 PDF parse error - org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary

I am using Solr 4.0 and DIH (data import handler) with TikaProcessor for extracting text from PDF files stored in database. When I run indexing it gets failed to parse some PDF files and got the stack trace mentioned below. Since Solr 4.0 uses Tika…
Phani Kumar
  • 440
  • 1
  • 5
  • 16
0
votes
1 answer

Can Solr retain the formatting of the HTML documents whcih was fed to it in its result?

How do I maintain the Original formatting of the HTML document in the results given by Solr? I am trying to provide search functionality in one of my companies website that is having millions of documents and all are not having similar formatting,…
Mantra
  • 316
  • 3
  • 16
0
votes
1 answer

Tika fetches the binary content stored in database but does not indexes it

I am trying to parse the binary content data stored in database in table document_attachment in column file_data and trying to index the same so that it's content becomes available for searching using Solr. When I run the indexer it fetches the rows…
Chhavi Gangwal
  • 1,166
  • 9
  • 13
0
votes
1 answer

NoClassDefFoundError errors in Sling logs when uploading docx, xslx, pptx

I am getting the below multiple errors (see below - one per file) when uploading any office 2007 docs (e.g. pptx, docx, xslx) into Sling. I am using Sling 6 stable standalone. Is anyone else experiencing this? Are there any known issues with the…
NabilS
  • 1,421
  • 1
  • 19
  • 31
0
votes
1 answer

TIKA parsing feedback

Does a list exist of what types of feedback TIKA can provide about files it cannot parse? I'm trying to decide whether or not to provide end user feedback or feedback for an operations team or both based on what TIKA can tell me. For example if a MS…
user195166
  • 417
  • 5
  • 16
0
votes
1 answer

python detect image in a document

How can I detect images in a document say doc,xls,ppt or pdf ? I came across with Apache Tika, I am trying command line option. http://tika.apache.org/1.2/gettingstarted.html I am using Python2.7.. But not quite sure how it will detect images. i am…
user1839132
  • 121
  • 2
  • 10
0
votes
1 answer

Apache Tika : parsing visio files (.vsd)

I'm currently writing a program in Java to extract metadata from multiple document type. At the moment I'm trying to extract metadata from .vsd files using Apache Tika. I previously tried using Apache POI directly, but the fact is it's very hard to…
Bdloul
  • 882
  • 10
  • 27
0
votes
3 answers

Configure apache solr3.6 with tika1.2

I am using solr3.6 with tika1.2 but I can't upload pdf files. First I install solr and upload some *.xml files from the exampledocs. This files I could search with this URL http://localhost:8983/solr/select/?q=solr. And in the next step I install…
0
votes
1 answer

Eclipse Juno EE NoClassDefFoundError when using external Jar

I added an external jar in my eclipse dynamic webproject via Folder -> properties -> build path -> Libraries -> add external jar. The code is working fine on compile time. package servlet; import java.io.IOException; import…
user962206
  • 15,637
  • 61
  • 177
  • 270
0
votes
1 answer

Solr - Multiple attachments under one Data Import Handler record

I'm using Data Import Handler (DIH) to create documents in solr. Each document will have zero or more attachments. The attachments' (e.g. PDFs, Word docs, etc.) content is parsed (via Tika) and stored along with a path to the attachment. The…
James
  • 2,876
  • 18
  • 72
  • 116
0
votes
1 answer

Solr - Tika - Parsing Content to Enable Highlighting

My understanding is that indexing a PDF, Word, Excel, etc. document through Solr will allow searching but not highlighting. I have this code to perform the indexing: String urlString = "http://localhost:8983/solr"; SolrServer solr =…
James
  • 2,876
  • 18
  • 72
  • 116
0
votes
1 answer

Solr Tika XPath Exception

I'm trying to index an HTML document using Apache Solr and the TikaEntityProcessor, with the idea being that I can use XPath to select specific elements from the HTML. I have followed the advanced example shown at the bottom of the…
Sam Delaney
  • 1,305
  • 11
  • 10
0
votes
1 answer

Tika exception error while indexing rich documents rails 3

Well I am just implementing full text search in rich documents using sunspot_cell. I am using paperclip for attachment. I have done all the required configurations and include all the *.jar files in solr/lib dir. But its not able index the document.…
0
votes
0 answers

Extracting images from HTML from

I have the following xhtml file, which contains about 30-40 images. The file is auto-generated and the numbers of the image will change, but the {html text} content which should really be do not change. I was hoping someone could point me in the…
awm
  • 2,723
  • 2
  • 18
  • 26
0
votes
1 answer

Apache Tika alternatives for ios

I know that Apache Tika is a text extractor. It can extract text from doc, pdf, ppt and lots of other file formats. Now I need this function in ios, so I want to know is there any alternative to Apache Tika for ios? If there is no such library for…
jjyao
  • 315
  • 5
  • 16