Questions tagged [apache-tika]

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types.

For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities.

While Tika is written in Java, it is widely used from other languages. The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

tika-framework

Related Tags:

1283 questions
0
votes
1 answer

how to get particular metadata tags from files using apache tika

I have some files in a folder(sample.pdf, sample.html etc) and i am using following Apache tika command to extract metadata. java -jar tika-app.jar -m -j /sample/sample.pdf > test.txt after executing this command i am able to get all the metadata…
user2353439
  • 489
  • 2
  • 7
  • 18
0
votes
0 answers

Configure Solr Index With File Metadata using TikaEntityProcessor & FieldStreamDataSource

I have created an index that pulls data from a SQL Server database using TikaEntityProcessor. The query associated with my configuration file pulls from a table containing file information, as well as the file content as a binary column. My index…
Nathan Hall
  • 409
  • 2
  • 8
  • 17
0
votes
1 answer

Hibernate Search ErrorHandler: Continue Indexing

I'm using the massindexer to index my domain model for a project I'm working on; my domain model includes file bytes stored in the database. I've properly annotated my domain model with the TikaBridge annotation for the collections of files inside…
user1170235
  • 189
  • 3
  • 9
0
votes
1 answer

SOLR 4.1 Language Detection

I'm trying to use LangDetectLanguageIdentifierUpdateProcessorFactory that comes with SOLR to detect languages when indexing documents. It looks pretty straightforward implementation, i have put following to…
rusho1234
  • 241
  • 2
  • 12
0
votes
2 answers

how to switch off / on indexing in a web page

I'm using Nutch 1.6 and Solr 4.3 on Ubuntu Server 12.04 I would like to switch on and off content indexing. Is there a way to specify this behaviour in my HTML pages so that Solr can behave accordingly ? As an example, when using Google Search…
MarioCannistra
  • 275
  • 3
  • 12
0
votes
0 answers

cannot parse pdf using Tika1.3 (+lucene4.2)

im trying to parse a pdf file and get its metadata and text.I still don't get the wanted results. I am sure it is a silly mistake, but i cant see it.The file d.pdf exists and it is located in the project's root folder.The imports are also…
yeaaaahhhh..hamf hamf
  • 746
  • 2
  • 13
  • 34
0
votes
1 answer

Running Java Application using Popen

I am running tika on my Linux server, and I want to run it using python (subprocess.Popen) However, I have a non-root access, so I only have a local java installation. Every time I need to set the java home and path for each session : export…
hmghaly
  • 1,411
  • 3
  • 29
  • 47
0
votes
1 answer

Apache tika, reading parsed body in MailContextHandler

The source code of the MailContentHandler has this: try { BodyContentHandler bch = new BodyContentHandler(handler); parser.parse(is, new EmbeddedContentHandler(bch), submd, context); I would like to read the body content as…
Chris
  • 923
  • 1
  • 8
  • 11
0
votes
2 answers

MSword to XML/HTML using Apache Tika

I happened to know Tika, very useful in text extraction from word: curl www.vit.org/downloads/doc/tariff.doc \ | java -jar tika-app-1.3.jar --text But is there a way to use it to convert the Ms Word file into XML/HTML?
hmghaly
  • 1,411
  • 3
  • 29
  • 47
0
votes
1 answer

Convert word and excel to html on android

I would like to convert word and excel documents to html to show them in the browser in my android app. I found apache poi library, but it converts practically only text without object like forms, diagrams, wordart etc. Or is it possible and I…
Bartosz Bialecki
  • 4,391
  • 10
  • 42
  • 64
0
votes
0 answers

Language detection from scanned PDF documents

I am trying to find the language of a PDF document and categorize it. The major problem I face is the document is scanned PDF document. There is no clue of fonts or Unicode. So Apache Tikka Doesn't do much help here. I tried using tesseract to…
karthikselva
  • 123
  • 8
0
votes
0 answers

Apache Tika OSGi bundle fail when doing mvn install

I am trying to install Apache Tika Toolkit on Windows Server 2008 R2. I then go to the folder where Apache Tika is and execute mvn install from the command line. Yet, that produce a failure error that says: [ERROR] Failed to execute goal…
fulupr
  • 129
  • 2
  • 11
0
votes
2 answers

Error in configuring object when converting intoTika using Behemoth and map reduce

I am running the command to convert behemoth corpus to tika using map reduce as given in this tutorial I am getting following error on doing it: 13/02/25 14:44:00 INFO mapred.FileInputFormat: Total input paths to process : 1 13/02/25 14:44:01…
Shrey Shivam
  • 1,107
  • 1
  • 7
  • 16
0
votes
1 answer

Where does Apache Tika obtain its "counts" from?

If I have the following code to read the number of paragraphs (Office.PARAGRAPH_COUNT) from a PDF: TikaInputStream pdfStream = TikaInputStream.get(new File("some-doc.pdf")); ContentHandler handler = new DefaultContentHandler(); Metadata pdfMeta =…
user1768830
0
votes
2 answers

Extract contents of files from remote ftp server without writing to file in local disk

After establishing connection to a remote ftp or sftp server programmatically using java is it possible to read the files of /home/www-data/content/ without writing to a file in local system. Basically i want to extract metadata of files using…
user850234
  • 3,373
  • 15
  • 49
  • 83