Questions tagged [apache-tika]

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types.

For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities.

While Tika is written in Java, it is widely used from other languages. The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

tika-framework

Related Tags:

1283 questions
0
votes
1 answer

Read the metadata of a XLSM file size 6MB -Apache POI 3.9-

I need to read XLSM file metadata, to files less than 4 MB The following instructions work correctly: try { OPCPackage pkg = OPCPackage.open (new FileInputStream ("C:\\Path to file.xlsm")); XSSFWorkbook XSSFWorkbook = new document…
user2671914
  • 45
  • 2
  • 9
0
votes
1 answer

Set the field "Last Modified By" in the office DOCX file metadata -Apache POI 3.9-

With POIXMLProperties.getCoreProperties() and POIXMLProperties.getExtendedProperties() I can set all the metadata values ​​except "Last Modified By", Is there any way to set it? Thanks for advance.
user2671914
  • 45
  • 2
  • 9
0
votes
0 answers

How to read raw text from pdf file using java

I am using pdf box parser to read data from pdf file using java.It will read all the content from pdf file. Below is my sample code to read data from pdf file and store it into text file. Sample Code: public class PDFTextParser { PDFParser…
user2664353
  • 127
  • 1
  • 2
  • 5
0
votes
1 answer

Nutch ERROR tika.TikaParser on Eclipse

I am running Nutch 2.2.1 on Eclipse Juno SR1 and JRE 1.7.0_25 The PARSE step is failing with this error: 2013-08-15 19:35:26,555 ERROR tika.TikaParser - Can't retrieve Tika parser for mime-type application/pdf 2013-08-15 19:35:26,557 WARN …
Osy
  • 1,613
  • 5
  • 21
  • 35
0
votes
1 answer

Solr tika not storing any data

I am faced with a peculiar problem. I configured my data config and schema as per the solr wiki here : Tika DIH Data config is like :
Varun Jain
  • 1,901
  • 7
  • 33
  • 66
0
votes
1 answer

Apache Tika MP4 metadata keys

I'm using Apache Tika v 1.4 to parse video files in the following way. Metadata metadata = new Metadata(); String content = new Tika().parseToString(file.getInputStream(), metadata); metadata.get(KEY) The problem is that I don't know which keys to…
0
votes
2 answers

Best Tika integration on Solr or Nutch

Which is the best integration for Apache Tika assuming that I already connected and used Nutch(2.2.1) + Solr (4.3)? I understand that Tika can be integrated within Nutch and/or Solr, but which one is the best decision?
Osy
  • 1,613
  • 5
  • 21
  • 35
0
votes
1 answer

What is the right way to add apache-tika dependency into grails project

When using tika-1.4 getting this: Caused by: java.lang.NoClassDefFoundError: net/sf/cglib/core/DebuggingClassWriter at net.sf.cglib.core.DefaultGeneratorStrategy.getClassWriter(DefaultGeneratorStrategy.java:30) at…
Archer
  • 5,073
  • 8
  • 50
  • 96
0
votes
1 answer

After indexing file how to extract properties of file such as:-file type,name etc. by elasticsearch

I had indexed the document and I am able to search content of document. But I want to find the type of document is indexed, author of document , name of document, size of document basically properties of file. How it can be achieved with the help…
Lav
  • 1,017
  • 2
  • 11
  • 16
0
votes
1 answer

how to read files with special characters in python

I have crawled pdf,html,doc files using Apache Tika and stored structured text into text files.These text files contain some unusual special characters,because of these special characters i am unable to read those text files.I have below code…
user2609542
  • 801
  • 4
  • 13
  • 20
0
votes
0 answers

Can Apache Tika Extract Attachments?

I am using Apache Tika to extract text from various document formats. I would like to extract images from those files as well (usually PDF or Word). I am using TikaCLI as a proof of concept with the -z (--extract) option, but it never extracts any…
jriffel73
  • 128
  • 1
  • 9
0
votes
2 answers

Remove all special characters from file line except white space

I have extracted text using tika for some pdf files and stored the text in text files. Now i want to parse these files using opennlp Chunk parser, but i was unable to parse the file lines because it contains some special characters in it(some square…
user2609542
  • 801
  • 4
  • 13
  • 20
0
votes
0 answers

how to extarct keywords or tags from the file content

I have some files of different formats(Html,PDF,doc,epub), using apache tika and java i have extracted metadata and stored it into mongo db, now my aim is to extract keywords or tags from the file content and add it to one of the metadata fields, is…
user2522836
  • 93
  • 2
  • 9
0
votes
3 answers

Using Java Libraries With Incompatible Dependencies

I'm working on a project where I'd like to use Apache Tika and Apache Jena. However, when I try to run the project I get the following exception: java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log…
James Baker
  • 1,143
  • 17
  • 39
0
votes
1 answer

Write output values to a json file using java

Hi below is my code to extract particular metadata tags and write those tags to a json file. And i imported json.lib.jar and tika-app.jar into my build path. File dir = new File("C:/pdffiles"); File listDir[] = dir.listFiles(); for (int i = 0; i <…
user2353439
  • 489
  • 2
  • 7
  • 18