Questions tagged [apache-tika]

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types.

For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities.

While Tika is written in Java, it is widely used from other languages. The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

1283 questions

votes

0 answers

Tika python does not preserve the order of texts in pdf

I am using tika-python to extract text from pdf. But when there are multiple table in a pdf page, the order of the text is not preserved. In my case the table at the top of the page comes at the end when extracted through tika. I tried using…

python apache-tika tika-server

asked May 14 '20 at 11:12

ggaurav

1,764
1
10
10

votes

3 answers

AttributeError: 'bytes' object has no attribute 'close' when Tika parser is run

Im trying to run a simple parse line of code using Tika to parse text from a PDF (named outputFileName in this example). This used to run without errors. I recently had my laptop sent in to our work IT for software updates and had to resintall…

python parsing apache-tika pdf-parsing tika-server

asked Nov 11 '19 at 14:46

dweir247

votes

1 answer

Is there a way to turn off parsing of embedded docs in the tika-server?

I run an unmodified JAX-RS instance of the Apache tika-server 1.22 and use it as an HTTP end-point service that I post files to (mostly Office, PDF and RTF) and get plain-text renditions back with HTTP requests (using the Accept="text/plain" header)…

apache-tika tika-server

asked Oct 10 '19 at 08:29

henrythewasp

votes

1 answer

Java/Spring: How to Figure out MimeType on an InputStream Without Consuming It

BASICS This is a Java 1.8 Spring Boot 1.5 Application. It currently uses Apache Tika 1.22 to read Mime-Type information, but this can easily be changed. SUMMARY There is a mapper which User uses to download files. These files come from another URL…

java spring apache-tika

asked Sep 22 '19 at 01:26

Miss Kitty

votes

1 answer

Apache Tika with Encrypted PDF

I wanted to extract PDF content using Apache Tika Library. All is good until I encountered PDF with encrypted username and password. It hits errors as below: INFO Document is encrypted org.apache.tika.exception.EncryptedDocumentException:…

adobe apache-tika

asked Aug 27 '19 at 03:08

fattysxx

votes

1 answer

Python Tika cannot parse pdf from url

python for parsing the online pdf for future usage. My code are below. from tika import parser import requests import io url = 'https://www.whitehouse.gov/wp-content/uploads/2017/12/NSS-Final-12-18-2017-0905.pdf' response = requests.get(url) with…

python apache-tika tika-server

asked Nov 25 '18 at 16:28

Platalea Minor

votes

1 answer

Unable to set character encoding in java.util.Scanner

I use Apache Tika to get encoding of file. FileInputStream fis = new FileInputStream(my_file); final AutoDetectReader detector = new AutoDetectReader(fis); fis.close(); System.out.println("Encoding:" +…

java java.util.scanner apache-tika

asked Nov 06 '18 at 15:27

plaidshirt

5,189
19
91
181

votes

2 answers

Python - Apache Tika Single Page parser

I was wondering if there is any way using Tika/Python to only parse the first page or extract the metadata from the first page only? Right now, when I pass the pdf, it is parsing every single page. I looked that this link: Is it possible to extract…

python apache-tika tika-server

asked Nov 01 '18 at 00:05

sharp

2,140
9
43
80

votes

1 answer

"zip bomb" exception while sending HTML document to Solr

I'm sending a HTML document to Solr and Tika is throwing the "Zip bomb detected!" exception back. Solr log reports: "Suspected zip bomb: 100 levels of XML element nesting" Looking at Tika source, there is an arbitrary limit of 100 level of XML…

solr apache-tika

asked Apr 06 '18 at 18:46

Harinder

votes

1 answer

How to boost a SOLR document when indexing with /solr/update

To index my website, I have a Ruby script that in turn generates a shell script that uploads every file in my document root to Solr. The shell script has many lines that look like this: curl -s \ …

solr apache-tika solr-cell

asked Feb 09 '11 at 02:24

Dan Tenenbaum

1,809
3
23
35

votes

2 answers

Apache Tika - detect JSON / PDF specific mime type

I'm using Apache Tika to detect a file Mime Type from its base64 rapresentation. Unfortunately I don't have other info about the file (e.g. extension). Is there something I can do to make Tika be more specific? I'm currently using this: Tika tika =…

java mime-types apache-tika

asked Feb 05 '18 at 08:45

Briston12

votes

2 answers

Apache tika detects mime-type incorrectly for csv

I've created .csv file using excel and I wrote following code using apache tika: public static boolean checkThatMimeTypeIsCsv(InputStream inputStream) throws IOException { BufferedInputStream bis = new BufferedInputStream(inputStream); …

java csv apache-tika file-type probe

asked Oct 26 '17 at 17:18

gstackoverflow

36,709
117
359
710

votes

2 answers

Apache Tika vs. Apache Lucene

I would have a question concerning analyzing documents. With Apache Tika, it is possible to get content and metadata of different files with different types. Is it also possible to get keywords of files (i.e. stemming) with Tika or do I still need…

lucene apache-tika

asked Oct 10 '17 at 09:26

quma

5,233
26
80
146

votes

1 answer

Extract text from a pdf file using Apache Tika in java

try { File file = new File("Example.pdf"); String content = new Tika().parseToString(file); System.out.println("The Content: " + content); } catch (Exception e) { e.printStackTrace(); } I have imported java.io.File…

java apache apache-tika

asked Jul 31 '17 at 11:04

Abhi Thakkar

votes

1 answer

Apache Tika how to extract html body with out header and footer content

I am looking to extract entire body content of html except header and footer, however I am getting exception org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared Below is my code that i have created as mentioned at…

html parsing apache-tika boilerpipe

asked Mar 03 '17 at 21:53

Trinadh Gupta

Prev 1 2 3

…

85 86 Next