Questions tagged [apache-tika]

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types.

For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities.

While Tika is written in Java, it is widely used from other languages. The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

1283 questions

votes

1 answer

Read the metadata of a XLSM file size 6MB -Apache POI 3.9-

I need to read XLSM file metadata, to files less than 4 MB The following instructions work correctly: try { OPCPackage pkg = OPCPackage.open (new FileInputStream ("C:\\Path to file.xlsm")); XSSFWorkbook XSSFWorkbook = new document…

apache-poi apache-tika

asked Aug 26 '13 at 19:47

user2671914

votes

1 answer

Set the field "Last Modified By" in the office DOCX file metadata -Apache POI 3.9-

With POIXMLProperties.getCoreProperties() and POIXMLProperties.getExtendedProperties() I can set all the metadata values except "Last Modified By", Is there any way to set it? Thanks for advance.

apache-poi apache-tika

asked Aug 26 '13 at 19:02

user2671914

votes

0 answers

How to read raw text from pdf file using java

I am using pdf box parser to read data from pdf file using java.It will read all the content from pdf file. Below is my sample code to read data from pdf file and store it into text file. Sample Code: public class PDFTextParser { PDFParser…

java pdfbox apache-tika

asked Aug 16 '13 at 10:08

user2664353

votes

1 answer

Nutch ERROR tika.TikaParser on Eclipse

I am running Nutch 2.2.1 on Eclipse Juno SR1 and JRE 1.7.0_25 The PARSE step is failing with this error: 2013-08-15 19:35:26,555 ERROR tika.TikaParser - Can't retrieve Tika parser for mime-type application/pdf 2013-08-15 19:35:26,557 WARN …

eclipse nutch apache-tika

asked Aug 16 '13 at 00:57

Osy

1,613
5
21
35

votes

1 answer

Solr tika not storing any data

I am faced with a peculiar problem. I configured my data config and schema as per the solr wiki here : Tika DIH Data config is like :

apache solr apache-tika

asked Aug 16 '13 at 00:48

Varun Jain

1,901
7
33
66

votes

1 answer

Apache Tika MP4 metadata keys

I'm using Apache Tika v 1.4 to parse video files in the following way. Metadata metadata = new Metadata(); String content = new Tika().parseToString(file.getInputStream(), metadata); metadata.get(KEY) The problem is that I don't know which keys to…

apache apache-tika mp4parser

asked Aug 12 '13 at 06:07

Asher Gruber

votes

2 answers

Best Tika integration on Solr or Nutch

Which is the best integration for Apache Tika assuming that I already connected and used Nutch(2.2.1) + Solr (4.3)? I understand that Tika can be integrated within Nutch and/or Solr, but which one is the best decision?

solr nutch apache-tika

asked Aug 08 '13 at 17:46

Osy

1,613
5
21
35

votes

1 answer

What is the right way to add apache-tika dependency into grails project

When using tika-1.4 getting this: Caused by: java.lang.NoClassDefFoundError: net/sf/cglib/core/DebuggingClassWriter at net.sf.cglib.core.DefaultGeneratorStrategy.getClassWriter(DefaultGeneratorStrategy.java:30) at…

grails apache-tika

asked Aug 07 '13 at 08:16

Archer

5,073
8
50
96

votes

1 answer

After indexing file how to extract properties of file such as:-file type,name etc. by elasticsearch

I had indexed the document and I am able to search content of document. But I want to find the type of document is indexed, author of document , name of document, size of document basically properties of file. How it can be achieved with the help…

elasticsearch apache-tika

asked Aug 05 '13 at 11:00

Lav

1,017
2
11
16

votes

1 answer

how to read files with special characters in python

I have crawled pdf,html,doc files using Apache Tika and stored structured text into text files.These text files contain some unusual special characters,because of these special characters i am unable to read those text files.I have below code…

python file apache-tika

asked Aug 02 '13 at 10:19

user2609542

votes

0 answers

Can Apache Tika Extract Attachments?

I am using Apache Tika to extract text from various document formats. I would like to extract images from those files as well (usually PDF or Word). I am using TikaCLI as a proof of concept with the -z (--extract) option, but it never extracts any…

java apache-tika

asked Jul 23 '13 at 11:33

jriffel73

votes

2 answers

Remove all special characters from file line except white space

I have extracted text using tika for some pdf files and stored the text in text files. Now i want to parse these files using opennlp Chunk parser, but i was unable to parse the file lines because it contains some special characters in it(some square…

java file apache-tika opennlp

asked Jul 23 '13 at 07:29

user2609542

votes

0 answers

how to extarct keywords or tags from the file content

I have some files of different formats(Html,PDF,doc,epub), using apache tika and java i have extracted metadata and stored it into mongo db, now my aim is to extract keywords or tags from the file content and add it to one of the metadata fields, is…

java tags metadata keyword apache-tika

asked Jul 02 '13 at 05:12

user2522836

votes

3 answers

Using Java Libraries With Incompatible Dependencies

I'm working on a project where I'd like to use Apache Tika and Apache Jena. However, when I try to run the project I get the following exception: java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log…

java slf4j jena apache-tika

asked Jul 01 '13 at 13:58

James Baker

1,143
17
39

votes

1 answer

Write output values to a json file using java

Hi below is my code to extract particular metadata tags and write those tags to a json file. And i imported json.lib.jar and tika-app.jar into my build path. File dir = new File("C:/pdffiles"); File listDir[] = dir.listFiles(); for (int i = 0; i <…

json file metadata apache-tika

asked Jun 24 '13 at 12:01

user2353439

Prev 1 2 3

…

85 86 Next