Questions tagged [apache-tika]

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types.

For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities.

While Tika is written in Java, it is widely used from other languages. The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

tika-framework

Related Tags:

1283 questions
0
votes
1 answer

Storing PDFs in Solr

I'm trying to set things up (in my local environment) so I can store PDFs in Solr, but I cannot get it to work. Right now I'm working with the files in the example folder Solr provides. I did not modify the solrconfig.xml in solr-3.6.0/example/conf…
ceiroa
  • 5,833
  • 2
  • 21
  • 18
0
votes
2 answers

Using TIKA and POI in the same project without getting into version issues?

I've got a requirement to generate reports as xls-sheets, but I already have TIKA in my project. Now TIKA does include POI, what worries me here is that TIKA 1.2 (which I'm using currently) includes a beta build of POI 3.8. I foresee the day when I…
Durandal
  • 19,919
  • 4
  • 36
  • 70
0
votes
1 answer

To extract remote files, use tika.open(url) or wget to download to local first?

Tika can use a url parameter to extract remote files. and we can also download the remote file, then let tika extract it like a local file. From the performance and correctness views, which way is a better choice? Thanks.
internal
  • 1
  • 1
0
votes
1 answer

solr search return results but some sentences before&after the text search are required

I am using Apache Solr 3.6.0. I have indexed a file with this command: curl "http://localhost:8983/solr/update/extract?stream.file=/home/Desktop/DOCUMENTS/x.pdf&stream.contentType=application/pdf&literal.id=DOC_N&commit=true" when I search for the…
Asif S. Abid
  • 57
  • 1
  • 14
0
votes
0 answers

trying to port Tika 1.0 to Android in Eclipse: error messages refercing pom.xml

I am trying to port Tika 1.0 core and parsers source code to Android in Eclipse and having problems. Here's what I did: Downloaded Tika 1.0 source Opened core and parsers sub-projects in Eclipse using Maven plugin Exported both into their…
I Z
  • 5,719
  • 19
  • 53
  • 100
0
votes
0 answers

Lucene tika indexing failure

I wrote (mostly copied from lucene-in-action ebook) an indexing example using Tika. But it doesn't index the documents at all. There is no error on compile or run. I tried indexing a .pdf, .ppt, .doc, even .txt document, no use, at search returns 0…
MRM
  • 561
  • 5
  • 12
  • 29
-1
votes
1 answer

java.lang.UnsatisfiedLinkError: no lcms in java.library.path: [/usr/lib/jvm/java-11-openjdk/lib/server

I am using PDF parser class from apache tika parser jar which is working fine with openjdk 8 but same code is failing when i have updated the openjdk to 11. i have tried updating the tika parser version to latest but still code is failing with the…
DeadPool
  • 40
  • 8
-1
votes
1 answer

How do i clean extracted code from a pdf so i can use it later

I am trying to extract data out of invoices (pdf), write that data into a csv and extract the needed information into a GUI (for example how many of that product were sold that week) I cant use pypdf because the "print to pdf" in windows apparently…
Schicki
  • 11
  • 3
-1
votes
1 answer

What model does apache tika use internally - TensorflowRESTCaptioner

Iam working on an image captioning tool and came across the apache tika TensorflowRESTCaptioner and would like to now which model does it use internally and how good are the results when compared with the state of the art right now in the…
user11840960
-1
votes
1 answer

Text extraction for FITS similar to NetCDF?

I'm working with NetCDF and FITS files and I have Tika working for extracting the header text in NetCDF files but I can only get basic file metadata for FITS files. Does header text extraction not work on FITS files? Followed this for…
mutanthumb
  • 161
  • 1
  • 13
-1
votes
1 answer

Downloading file from Dropbox API for use in Python Environment with Apache Tika on Heroku

I'm trying to use Dropbox as a cloud-based file receptacle for an app/script. The script, written in Python, needs to take PDFs from the Dropbox and use the tika-python wrapper to convert to string. I'm able to connect to the Dropbox API and use the…
jsxgd
  • 403
  • 1
  • 5
  • 16
-1
votes
1 answer

Parsing / Converting legacy Word documents? (msword2 / 5)

We got some really old .doc documents. Normally we use tika (our application normally does a text extract and then a PDF/A convert) but apparently msword2 (and msword5) are not supported currently. The only alternative I found was Libreoffice…
Zanndorin
  • 360
  • 3
  • 15
-1
votes
2 answers

how to update metadata of file with tika

I want to know if it's possible to update metadata of openoffice file in java with the lib Apache Tika. If it's not possible, is there any other lib or api witch can let me do it ?
supp
  • 39
  • 2
  • 6
-1
votes
1 answer

How to resolve properly transitive dependencies of Tika in Fuse (camel) bundle?

I'm trying to implement Tika functionality in a Fuse (6.3) project. In the last current version 1.16 Tika offers Osgi bundle with parsers. I can't achieve the proper osgi way to include Tika in my project. Any hint how can I have to create the…
-1
votes
2 answers

ByteArrayOutputStream performance

My requirement is to create 2 copies of the inputstream, one for Apache Tika File MimeType Detect and another to Output Stream. private List copyInputStream(final InputStream pInputStream, final int numberOfCopies) throws…
1 2 3
85
86