Questions tagged [apache-tika]

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types.

For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities.

While Tika is written in Java, it is widely used from other languages. The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

1283 questions

votes

1 answer

how to get particular metadata tags from files using apache tika

I have some files in a folder(sample.pdf, sample.html etc) and i am using following Apache tika command to extract metadata. java -jar tika-app.jar -m -j /sample/sample.pdf > test.txt after executing this command i am able to get all the metadata…

java linux file metadata apache-tika

asked Jun 24 '13 at 07:11

user2353439

votes

0 answers

Configure Solr Index With File Metadata using TikaEntityProcessor & FieldStreamDataSource

I have created an index that pulls data from a SQL Server database using TikaEntityProcessor. The query associated with my configuration file pulls from a table containing file information, as well as the file content as a binary column. My index…

asp.net sql solr apache-tika

asked Jun 20 '13 at 20:04

Nathan Hall

votes

1 answer

Hibernate Search ErrorHandler: Continue Indexing

I'm using the massindexer to index my domain model for a project I'm working on; my domain model includes file bytes stored in the database. I've properly annotated my domain model with the TikaBridge annotation for the collections of files inside…

hibernate search hibernate-search apache-tika

asked Jun 20 '13 at 19:40

user1170235

votes

1 answer

SOLR 4.1 Language Detection

I'm trying to use LangDetectLanguageIdentifierUpdateProcessorFactory that comes with SOLR to detect languages when indexing documents. It looks pretty straightforward implementation, i have put following to…

solr multilingual apache-tika

asked May 28 '13 at 21:33

rusho1234

votes

2 answers

how to switch off / on indexing in a web page

I'm using Nutch 1.6 and Solr 4.3 on Ubuntu Server 12.04 I would like to switch on and off content indexing. Is there a way to specify this behaviour in my HTML pages so that Solr can behave accordingly ? As an example, when using Google Search…

solr nutch apache-tika

asked May 17 '13 at 09:55

MarioCannistra

votes

0 answers

cannot parse pdf using Tika1.3 (+lucene4.2)

im trying to parse a pdf file and get its metadata and text.I still don't get the wanted results. I am sure it is a silly mistake, but i cant see it.The file d.pdf exists and it is located in the project's root folder.The imports are also…

parsing lucene apache-tika pdf-parsing

asked May 07 '13 at 17:19

yeaaaahhhh..hamf hamf

votes

1 answer

Running Java Application using Popen

I am running tika on my Linux server, and I want to run it using python (subprocess.Popen) However, I have a non-root access, so I only have a local java installation. Every time I need to set the java home and path for each session : export…

java python linux apache-tika

asked Apr 12 '13 at 00:09

hmghaly

1,411
3
29
47

votes

1 answer

Apache tika, reading parsed body in MailContextHandler

The source code of the MailContentHandler has this: try { BodyContentHandler bch = new BodyContentHandler(handler); parser.parse(is, new EmbeddedContentHandler(bch), submd, context); I would like to read the body content as…

java apache-tika

asked Apr 11 '13 at 13:34

Chris

votes

2 answers

MSword to XML/HTML using Apache Tika

I happened to know Tika, very useful in text extraction from word: curl www.vit.org/downloads/doc/tariff.doc \ | java -jar tika-app-1.3.jar --text But is there a way to use it to convert the Ms Word file into XML/HTML?

java apache-tika

asked Apr 09 '13 at 20:06

hmghaly

1,411
3
29
47

votes

1 answer

Convert word and excel to html on android

I would like to convert word and excel documents to html to show them in the browser in my android app. I found apache poi library, but it converts practically only text without object like forms, diagrams, wordart etc. Or is it possible and I…

android excel ms-word apache-poi apache-tika

asked Mar 28 '13 at 10:17

Bartosz Bialecki

4,391
10
42
64

votes

0 answers

Language detection from scanned PDF documents

I am trying to find the language of a PDF document and categorize it. The major problem I face is the document is scanned PDF document. There is no clue of fonts or Unicode. So Apache Tikka Doesn't do much help here. I tried using tesseract to…

pdf image-processing ocr tesseract apache-tika

asked Mar 26 '13 at 11:48

karthikselva

votes

0 answers

Apache Tika OSGi bundle fail when doing mvn install

I am trying to install Apache Tika Toolkit on Windows Server 2008 R2. I then go to the folder where Apache Tika is and execute mvn install from the command line. Yet, that produce a failure error that says: [ERROR] Failed to execute goal…

apache maven apache-tika

asked Mar 19 '13 at 15:30

fulupr

votes

2 answers

Error in configuring object when converting intoTika using Behemoth and map reduce

I am running the command to convert behemoth corpus to tika using map reduce as given in this tutorial I am getting following error on doing it: 13/02/25 14:44:00 INFO mapred.FileInputFormat: Total input paths to process : 1 13/02/25 14:44:01…

hadoop apache-tika behemoth

asked Feb 25 '13 at 09:51

Shrey Shivam

1,107
1
7
16

votes

1 answer

Where does Apache Tika obtain its "counts" from?

If I have the following code to read the number of paragraphs (Office.PARAGRAPH_COUNT) from a PDF: TikaInputStream pdfStream = TikaInputStream.get(new File("some-doc.pdf")); ContentHandler handler = new DefaultContentHandler(); Metadata pdfMeta =…

java pdf apache-tika

asked Feb 21 '13 at 11:19

user1768830

votes

2 answers

Extract contents of files from remote ftp server without writing to file in local disk

After establishing connection to a remote ftp or sftp server programmatically using java is it possible to read the files of /home/www-data/content/ without writing to a file in local system. Basically i want to extract metadata of files using…

java ftp sftp apache-tika

asked Feb 18 '13 at 19:50

user850234

3,373
15
49
83

Prev 1 2 3

…

85 86 Next