Questions tagged [apache-tika]

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types.

For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities.

While Tika is written in Java, it is widely used from other languages. The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

tika-framework

Related Tags:

1283 questions
4
votes
0 answers

Tika python does not preserve the order of texts in pdf

I am using tika-python to extract text from pdf. But when there are multiple table in a pdf page, the order of the text is not preserved. In my case the table at the top of the page comes at the end when extracted through tika. I tried using…
ggaurav
  • 1,764
  • 1
  • 10
  • 10
4
votes
3 answers

AttributeError: 'bytes' object has no attribute 'close' when Tika parser is run

Im trying to run a simple parse line of code using Tika to parse text from a PDF (named outputFileName in this example). This used to run without errors. I recently had my laptop sent in to our work IT for software updates and had to resintall…
dweir247
  • 63
  • 4
4
votes
1 answer

Is there a way to turn off parsing of embedded docs in the tika-server?

I run an unmodified JAX-RS instance of the Apache tika-server 1.22 and use it as an HTTP end-point service that I post files to (mostly Office, PDF and RTF) and get plain-text renditions back with HTTP requests (using the Accept="text/plain" header)…
4
votes
1 answer

Java/Spring: How to Figure out MimeType on an InputStream Without Consuming It

BASICS This is a Java 1.8 Spring Boot 1.5 Application. It currently uses Apache Tika 1.22 to read Mime-Type information, but this can easily be changed. SUMMARY There is a mapper which User uses to download files. These files come from another URL…
Miss Kitty
  • 162
  • 1
  • 3
  • 16
4
votes
1 answer

Apache Tika with Encrypted PDF

I wanted to extract PDF content using Apache Tika Library. All is good until I encountered PDF with encrypted username and password. It hits errors as below: INFO Document is encrypted org.apache.tika.exception.EncryptedDocumentException:…
fattysxx
  • 41
  • 2
4
votes
1 answer

Python Tika cannot parse pdf from url

python for parsing the online pdf for future usage. My code are below. from tika import parser import requests import io url = 'https://www.whitehouse.gov/wp-content/uploads/2017/12/NSS-Final-12-18-2017-0905.pdf' response = requests.get(url) with…
Platalea Minor
  • 877
  • 2
  • 9
  • 22
4
votes
1 answer

Unable to set character encoding in java.util.Scanner

I use Apache Tika to get encoding of file. FileInputStream fis = new FileInputStream(my_file); final AutoDetectReader detector = new AutoDetectReader(fis); fis.close(); System.out.println("Encoding:" +…
plaidshirt
  • 5,189
  • 19
  • 91
  • 181
4
votes
2 answers

Python - Apache Tika Single Page parser

I was wondering if there is any way using Tika/Python to only parse the first page or extract the metadata from the first page only? Right now, when I pass the pdf, it is parsing every single page. I looked that this link: Is it possible to extract…
sharp
  • 2,140
  • 9
  • 43
  • 80
4
votes
1 answer

"zip bomb" exception while sending HTML document to Solr

I'm sending a HTML document to Solr and Tika is throwing the "Zip bomb detected!" exception back. Solr log reports: "Suspected zip bomb: 100 levels of XML element nesting" Looking at Tika source, there is an arbitrary limit of 100 level of XML…
Harinder
  • 333
  • 2
  • 3
  • 23
4
votes
1 answer

How to boost a SOLR document when indexing with /solr/update

To index my website, I have a Ruby script that in turn generates a shell script that uploads every file in my document root to Solr. The shell script has many lines that look like this: curl -s \ …
Dan Tenenbaum
  • 1,809
  • 3
  • 23
  • 35
4
votes
2 answers

Apache Tika - detect JSON / PDF specific mime type

I'm using Apache Tika to detect a file Mime Type from its base64 rapresentation. Unfortunately I don't have other info about the file (e.g. extension). Is there something I can do to make Tika be more specific? I'm currently using this: Tika tika =…
Briston12
  • 357
  • 3
  • 15
4
votes
2 answers

Apache tika detects mime-type incorrectly for csv

I've created .csv file using excel and I wrote following code using apache tika: public static boolean checkThatMimeTypeIsCsv(InputStream inputStream) throws IOException { BufferedInputStream bis = new BufferedInputStream(inputStream); …
gstackoverflow
  • 36,709
  • 117
  • 359
  • 710
4
votes
2 answers

Apache Tika vs. Apache Lucene

I would have a question concerning analyzing documents. With Apache Tika, it is possible to get content and metadata of different files with different types. Is it also possible to get keywords of files (i.e. stemming) with Tika or do I still need…
quma
  • 5,233
  • 26
  • 80
  • 146
4
votes
1 answer

Extract text from a pdf file using Apache Tika in java

try { File file = new File("Example.pdf"); String content = new Tika().parseToString(file); System.out.println("The Content: " + content); } catch (Exception e) { e.printStackTrace(); } I have imported java.io.File…
Abhi Thakkar
  • 151
  • 4
  • 17
4
votes
1 answer

Apache Tika how to extract html body with out header and footer content

I am looking to extract entire body content of html except header and footer, however I am getting exception org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared Below is my code that i have created as mentioned at…
Trinadh Gupta
  • 306
  • 5
  • 18