Questions tagged [apache-tika]

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types.

For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities.

While Tika is written in Java, it is widely used from other languages. The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

tika-framework

Related Tags:

1283 questions
4
votes
0 answers

How to limit amount of extracted text with Tika server?

In my scenario, i have some large PDF files and would like to limit the amount of text extracted and returned by tika server. I know it's possible using Java library directly. However, how can I do this when making HTTP requests to tika-server /tika…
Eugene Shvets
  • 4,561
  • 13
  • 19
4
votes
2 answers

Tika detects docx file as Zip

I have the following test code to detect docx content type: @Test public void testContentTypeOfaWordDOCXFileIsReturnedCorrectlyByTheServer() throws IOException, TikaException { File docxFile = new File(FILE_COMPLETE_PATH); …
qartal
  • 2,024
  • 19
  • 31
4
votes
0 answers

Apache Tika - Parsing and extracting only metadata without reading content

Is there a way to configure the Apache Tikka so that it only extracts the metadata properties from the file and does not access the content of the file. ? We need a way to do this so as to avoid reading the entire content in larger files. The code…
Venki
  • 2,129
  • 6
  • 32
  • 54
4
votes
1 answer

How to access all the PDF metadata using pdfbox

I have a simple JAVA code that uses TIKA library to get the metadata of a PDF file and it lists the below metadata. Tika code: Metadata metadata = new Metadata(); tika.parse(file, metadata); String[] metadataNames = metadata.names(); for (String…
Learner
  • 2,303
  • 9
  • 46
  • 81
4
votes
0 answers

Unexpected RuntimeException from tika

I'm trying to extract the content of a large dataset that contains a mix of files (pdf, doc, ppt). I'm using tika-app-1.12.jar, when T run my code everything done perfectly then I got this error Exception in thread "main"…
Abeer zaroor
  • 320
  • 2
  • 17
4
votes
0 answers

How to configure Tesseract language for TikaEntityProcessor in Solr

I have a solr core, and i use TikaEntityProcessor in my DataImportHandler. I have tesseract installed and tika can extract text from images. But the default language is english. Here is the tika part of my data-import-handler.xml file
Veysel Ozdemir
  • 675
  • 7
  • 12
4
votes
1 answer

How to extract hyperlinks from office documents using tika

I'm using Apache Tika to extract raw text from various document formats including office. When extracting text from word documents that include hyperlinks, then only the text is extracted and the information about the hyperlink is lost. Is there a…
Matthias
  • 178
  • 2
  • 6
4
votes
0 answers

Apache Tika : How to use XPath queries

I am parsing an XML file using Apache Tika. I would like to extract certain tags with their content from the XML and store them in a HashMap. Right now, i can extract the entire content of the XML but the tags are lost //detecting the file type …
AbtPst
  • 7,778
  • 17
  • 91
  • 172
4
votes
0 answers

Lucene 4 - How to discard numeric terms in index?

I'm using Apache Tika to parse xml document before indexing with Apache Lucene. This is Tika part: BodyContentHandler handler = new BodyContentHandler(10*1024*1024); Metadata metadata = new Metadata(); FileInputStream inputstream = new…
tommy
  • 139
  • 9
4
votes
2 answers

Extract Images from PDF with Apache Tika

Apache Tika 1.6 has the ability to extract inline images from PDF documents. However, I've been struggling to get it to work. My use case is that I want some code that will extract the content and separately the images from any documents (not…
James Baker
  • 1,143
  • 17
  • 39
4
votes
4 answers

How to parse large text file with Apache Tika 1.5?

Problem: For my test, I want to extract text data from a 335 MB text file which is wikipedia's "pagecounts-20140701-060000.txt" with Apache Tika. My solution: I tried to use TikaInputStream since it provides buffering, then I tried to use…
minerals
  • 6,090
  • 17
  • 62
  • 107
4
votes
1 answer

get embedded resourses in doc files using apache tika

I have ms word documents containing text and images. I want to parse them to have xml structure for them. After researching I end up using apache tika for converting my documents. I can parse my doc to xml. here is my code: AutoDetectParser…
Mohamad Ghafourian
  • 1,052
  • 1
  • 14
  • 26
4
votes
0 answers

How to parse RTF document using Apache Tika in java

I am parsing one Document that contains RTF Content using Apache tika but it is giving some exception. it is not giving contents of document. Here is a piece of code : public String contentEx(File f) throws IOException, SAXException, …
Rahul Kulhari
  • 1,115
  • 1
  • 15
  • 44
4
votes
2 answers

How can I specify encoding when parsing text with Apache TIKA?

The question is pretty self-explanatory. The problem I am facing is that any Tika example code I found online is using a StringWriter, as shown below. If i could somehow make this use an OutputStreamWriter, I can specify the encoding no problem...…
superdemongob
  • 248
  • 5
  • 13
4
votes
1 answer

HTML Formatted Cell value from Excel using Apache POI

I am using apache POI to read an excel document. To say the least, it is able to serve my purpose as of now. But one thing where I am getting struck is extracting the value of cell as HTML. I have one cell wherein user will enter some string and…
AngelsandDemons
  • 2,823
  • 13
  • 47
  • 70