Questions tagged [apache-tika]

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types.

For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities.

While Tika is written in Java, it is widely used from other languages. The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

tika-framework

Related Tags:

1283 questions
0
votes
1 answer

Configuring Apache Tika

This documentations section states that Apache Tika can be configured using dedicated configuration file: https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
illegal-immigrant
  • 8,089
  • 9
  • 51
  • 84
0
votes
3 answers

Exclude menu from content extraction with tika

I generate html documents that contain a menu and a content part. Then I want to extract the content of these document to feed it to a lucene index. However, I would like to exclude the menu from the content extraction and thus only index the…
bertolami
  • 2,896
  • 2
  • 24
  • 41
0
votes
1 answer

Process zip file into Solr

I have to process zip file which content multiple zip files and these zip files have xml and image files. I have to index the data into solr, which should gives result as content of XML data. I tried default solr-Tika example, which returns only zip…
user2551549
  • 192
  • 2
  • 12
0
votes
1 answer

MimeType via Java Apache tika

I have a problem with file determination. On developer server and on production servers Apache tika determine all kind of files. But on test server most time I got : 'application/octet-stream' public static String detectMimeType(final File file)…
Oleksandr Samsonov
  • 1,067
  • 2
  • 14
  • 29
0
votes
3 answers

Set last_modified field when not defined in document in Solr

I'm using Solr 4.6 example's SimplePostTool to import documents from the filesystem to Solr. All it's ok, but the field last_modified is filled only when the original document has metadata for it. If the field is not present Solr extractor leaves…
Javier Alvarez
  • 1,409
  • 2
  • 15
  • 33
0
votes
1 answer

solr delta import not working with TikaEntityProcessor

I am trying to schedule a Delta-import with TikaEntityProcessor.The full import is working fine but Delta-import is not updating anything.There is no error either. This much server logs gets displayed,I am not able to figure out what went…
0
votes
1 answer

Can I use Tika for content extraction on Google App Engine?

I'm working on a web service that requires content extraction from uploaded documents (PDF, PowerPoint, Word, etc), and I'd like to host this on Google App Engine to keep costs low. Short of running a Google Compute VM to run Solr/Tika as a server,…
Carson
  • 17,073
  • 19
  • 66
  • 87
0
votes
3 answers

Java web service only responds on localhost, not by hostname (Apache Tika)

It's easier to show than to tell. This is from the Apache Tika web service: http://pastebin.com/jrCsVVtt On line 89 of that file, localhost is hard-coded: sf.setProviders(providers); sf.setAddress("http://localhost:" + TikaServerCli.DEFAULT_PORT +…
rianjs
  • 7,767
  • 5
  • 24
  • 40
0
votes
0 answers

solr reindexing only modified documents

I am using solr dataimporthandler tika for doing a search in rich documents such as word, pdf documents. Whenever there is a new file added or any file being changed I have to do a full import to include the changes in the search. As the number of…
0
votes
1 answer

Parsing HTML elements in Apache Tika

How to retrive this
with id 48227783 value using Apache TIKA ?
Ownage!
I try to retreive the value 'Ownage!' , I tried to use mapSafeElement , DefaultHtmlMapper objects seems…
akunyer
  • 107
  • 11
0
votes
1 answer

How to write custom ContentHandler using Apache Tika?

I want to extract text which is inside some tags like
,
, etc. from HTML files using Apache Tika. So I am writing custom ContentHandler which is supposed to extract information from these tags. My custom ContentHandler code looks like…
Shekhar
  • 11,438
  • 36
  • 130
  • 186
0
votes
2 answers

Retrieving absolute URL from a webpage

I want to extract full link from a HTML file. Full link I mean absolute links. I used Tika for this purpose. Here is my code: URL url = new URL("http://www.domainname.com/"); InputStream input = url.openStream(); LinkContentHandler linkHandler = new…
Alex
  • 1,406
  • 2
  • 18
  • 33
0
votes
0 answers

Apache Tika does not render td tag for blank cell in Excel file

When I parse the excel which has an empty cell, it doesn't create a extra td tag. If there are three cells of which middle one is empty, it skips the middle cell and only outputs 1st and 3rd cell with a td tag How can I tell tika not to ignore the…
Krishna
  • 486
  • 8
  • 20
0
votes
1 answer

Apache Tika text extraction on Google App Enginer

I need to extract text from a few document types (.doc .docx .pdf and .txt primarily) from email attachments. The application is running on Google App Engine. Apache Tika does exactly what I need it to, but I'm running to a SecurityException when it…
0
votes
0 answers

How to decode special characters with Apache Tika

I'm using Apache Tika to parse some MS Word documents to HTML (String). Problem is that some documents contains special characters (e.g. Mathematical Operators). Is any way how to solve it? Thank you for help. Input: Output Source…
Peter Jurkovic
  • 2,686
  • 6
  • 36
  • 55