Questions tagged [apache-tika]

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types.

For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities.

While Tika is written in Java, it is widely used from other languages. The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

1283 questions

votes

1 answer

Storing PDFs in Solr

I'm trying to set things up (in my local environment) so I can store PDFs in Solr, but I cannot get it to work. Right now I'm working with the files in the example folder Solr provides. I did not modify the solrconfig.xml in solr-3.6.0/example/conf…

java solr apache-tika

asked Aug 29 '12 at 21:27

ceiroa

5,833
2
21
18

votes

2 answers

Using TIKA and POI in the same project without getting into version issues?

I've got a requirement to generate reports as xls-sheets, but I already have TIKA in my project. Now TIKA does include POI, what worries me here is that TIKA 1.2 (which I'm using currently) includes a beta build of POI 3.8. I foresee the day when I…

java classpath apache-poi dependency-management apache-tika

asked Aug 24 '12 at 11:54

Durandal

19,919
4
36
70

votes

1 answer

To extract remote files, use tika.open(url) or wget to download to local first?

Tika can use a url parameter to extract remote files. and we can also download the remote file, then let tika extract it like a local file. From the performance and correctness views, which way is a better choice? Thanks.

solr wget apache-tika

asked Jul 23 '12 at 10:17

internal

votes

1 answer

solr search return results but some sentences before&after the text search are required

I am using Apache Solr 3.6.0. I have indexed a file with this command: curl "http://localhost:8983/solr/update/extract?stream.file=/home/Desktop/DOCUMENTS/x.pdf&stream.contentType=application/pdf&literal.id=DOC_N&commit=true" when I search for the…

solr apache-tika

asked Jun 15 '12 at 09:48

Asif S. Abid

votes

0 answers

trying to port Tika 1.0 to Android in Eclipse: error messages refercing pom.xml

I am trying to port Tika 1.0 core and parsers source code to Android in Eclipse and having problems. Here's what I did: Downloaded Tika 1.0 source Opened core and parsers sub-projects in Eclipse using Maven plugin Exported both into their…

android eclipse apache-tika

asked May 02 '12 at 20:09

I Z

5,719
19
53
100

votes

0 answers

Lucene tika indexing failure

I wrote (mostly copied from lucene-in-action ebook) an indexing example using Tika. But it doesn't index the documents at all. There is no error on compile or run. I tried indexing a .pdf, .ppt, .doc, even .txt document, no use, at search returns 0…

java lucene indexing apache-tika

asked Apr 10 '12 at 17:54

MRM

-1

votes

1 answer

java.lang.UnsatisfiedLinkError: no lcms in java.library.path: [/usr/lib/jvm/java-11-openjdk/lib/server

I am using PDF parser class from apache tika parser jar which is working fine with openjdk 8 but same code is failing when i have updated the openjdk to 11. i have tried updating the tika parser version to latest but still code is failing with the…

java apache-tika openjdk-11 pdfparser

asked Sep 05 '22 at 16:41

DeadPool

-1

votes

1 answer

How do i clean extracted code from a pdf so i can use it later

I am trying to extract data out of invoices (pdf), write that data into a csv and extract the needed information into a GUI (for example how many of that product were sold that week) I cant use pypdf because the "print to pdf" in windows apparently…

python apache-tika

asked Nov 21 '19 at 17:03

Schicki

-1

votes

1 answer

What model does apache tika use internally - TensorflowRESTCaptioner

Iam working on an image captioning tool and came across the apache tika TensorflowRESTCaptioner and would like to now which model does it use internally and how good are the results when compared with the state of the art right now in the…

apache tensorflow deep-learning apache-tika

asked Sep 05 '19 at 08:24

user11840960

-1

votes

1 answer

Text extraction for FITS similar to NetCDF?

I'm working with NetCDF and FITS files and I have Tika working for extracting the header text in NetCDF files but I can only get basic file metadata for FITS files. Does header text extraction not work on FITS files? Followed this for…

curl netcdf apache-tika fits

asked Jun 26 '18 at 15:56

mutanthumb

-1

votes

1 answer

Downloading file from Dropbox API for use in Python Environment with Apache Tika on Heroku

I'm trying to use Dropbox as a cloud-based file receptacle for an app/script. The script, written in Python, needs to take PDFs from the Dropbox and use the tika-python wrapper to convert to string. I'm able to connect to the Dropbox API and use the…

python-3.x dropbox-api apache-tika

asked May 14 '18 at 20:33

jsxgd

-1

votes

1 answer

Parsing / Converting legacy Word documents? (msword2 / 5)

We got some really old .doc documents. Normally we use tika (our application normally does a text extract and then a PDF/A convert) but apparently msword2 (and msword5) are not supported currently. The only alternative I found was Libreoffice…

pdf ms-word libreoffice apache-tika

asked May 14 '18 at 07:34

Zanndorin

-1

votes

2 answers

how to update metadata of file with tika

I want to know if it's possible to update metadata of openoffice file in java with the lib Apache Tika. If it's not possible, is there any other lib or api witch can let me do it ?

metadata apache-tika odt

asked Feb 19 '18 at 09:47

supp

-1

votes

1 answer

How to resolve properly transitive dependencies of Tika in Fuse (camel) bundle?

I'm trying to implement Tika functionality in a Fuse (6.3) project. In the last current version 1.16 Tika offers Osgi bundle with parsers. I can't achieve the proper osgi way to include Tika in my project. Any hint how can I have to create the…

apache-camel osgi apache-tika jbossfuse transitive-dependency

asked Aug 30 '17 at 09:18

Tvori

-1

votes

2 answers

ByteArrayOutputStream performance

My requirement is to create 2 copies of the inputstream, one for Apache Tika File MimeType Detect and another to Output Stream. private List copyInputStream(final InputStream pInputStream, final int numberOfCopies) throws…

java performance inputstream apache-tika bytearrayoutputstream

asked Aug 08 '17 at 09:32

Prateek Agarwal

Prev 1 2 3

…

86 Next