Questions tagged [apache-tika]

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types.

For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities.

While Tika is written in Java, it is widely used from other languages. The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

1283 questions

votes

1 answer

Configuring Apache Tika

This documentations section states that Apache Tika can be configured using dedicated configuration file: https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

apache solr apache-tika

asked Jan 30 '14 at 09:04

illegal-immigrant

8,089
9
51
84

votes

3 answers

Exclude menu from content extraction with tika

I generate html documents that contain a menu and a content part. Then I want to extract the content of these document to feed it to a lucene index. However, I would like to exclude the menu from the content extraction and thus only index the…

lucene html-parsing apache-tika

asked Jan 15 '14 at 12:20

bertolami

2,896
2
24
41

votes

1 answer

Process zip file into Solr

I have to process zip file which content multiple zip files and these zip files have xml and image files. I have to index the data into solr, which should gives result as content of XML data. I tried default solr-Tika example, which returns only zip…

solr apache-tika

asked Jan 11 '14 at 21:04

user2551549

votes

1 answer

MimeType via Java Apache tika

I have a problem with file determination. On developer server and on production servers Apache tika determine all kind of files. But on test server most time I got : 'application/octet-stream' public static String detectMimeType(final File file)…

java mime-types apache-tika

asked Dec 27 '13 at 12:46

Oleksandr Samsonov

1,067
2
14
29

votes

3 answers

Set last_modified field when not defined in document in Solr

I'm using Solr 4.6 example's SimplePostTool to import documents from the filesystem to Solr. All it's ok, but the field last_modified is filled only when the original document has metadata for it. If the field is not present Solr extractor leaves…

apache solr lucene metadata apache-tika

asked Dec 23 '13 at 14:54

Javier Alvarez

1,409
2
15
33

votes

1 answer

solr delta import not working with TikaEntityProcessor

I am trying to schedule a Delta-import with TikaEntityProcessor.The full import is working fine but Delta-import is not updating anything.There is no error either. This much server logs gets displayed,I am not able to figure out what went…

solr apache-tika

asked Dec 16 '13 at 08:09

souvik chakraborty

votes

1 answer

Can I use Tika for content extraction on Google App Engine?

I'm working on a web service that requires content extraction from uploaded documents (PDF, PowerPoint, Word, etc), and I'd like to host this on Google App Engine to keep costs low. Short of running a Google Compute VM to run Solr/Tika as a server,…

java google-app-engine solr apache-tika

asked Dec 03 '13 at 21:22

Carson

17,073
19
66
87

votes

3 answers

Java web service only responds on localhost, not by hostname (Apache Tika)

It's easier to show than to tell. This is from the Apache Tika web service: http://pastebin.com/jrCsVVtt On line 89 of that file, localhost is hard-coded: sf.setProviders(providers); sf.setAddress("http://localhost:" + TikaServerCli.DEFAULT_PORT +…

java web-services hostname apache-tika

asked Nov 15 '13 at 19:10

rianjs

7,767
5
24
40

votes

0 answers

solr reindexing only modified documents

I am using solr dataimporthandler tika for doing a search in rich documents such as word, pdf documents. Whenever there is a new file added or any file being changed I have to do a full import to include the changes in the search. As the number of…

solr apache-tika

asked Oct 27 '13 at 07:54

Susha Surendran

votes

1 answer

Parsing HTML elements in Apache Tika

How to retrive this

with id 48227783 value using Apache TIKA ?

Ownage!

I try to retreive the value 'Ownage!' , I tried to use mapSafeElement , DefaultHtmlMapper objects seems…

java html-parsing apache-tika

asked Oct 14 '13 at 19:44

akunyer

votes

1 answer

How to write custom ContentHandler using Apache Tika?

I want to extract text which is inside some tags like

, etc. from HTML files using Apache Tika. So I am writing custom ContentHandler which is supposed to extract information from these tags. My custom ContentHandler code looks like…

java html-parsing apache-tika

asked Oct 10 '13 at 13:32

Shekhar

11,438
36
130
186

votes

2 answers

Retrieving absolute URL from a webpage

I want to extract full link from a HTML file. Full link I mean absolute links. I used Tika for this purpose. Here is my code: URL url = new URL("http://www.domainname.com/"); InputStream input = url.openStream(); LinkContentHandler linkHandler = new…

java html apache-tika

asked Oct 05 '13 at 10:28

Alex

1,406
2
18
33

votes

0 answers

Apache Tika does not render td tag for blank cell in Excel file

When I parse the excel which has an empty cell, it doesn't create a extra td tag. If there are three cells of which middle one is empty, it skips the middle cell and only outputs 1st and 3rd cell with a td tag How can I tell tika not to ignore the…

java apache-tika

asked Sep 30 '13 at 09:22

Krishna

votes

1 answer

Apache Tika text extraction on Google App Enginer

I need to extract text from a few document types (.doc .docx .pdf and .txt primarily) from email attachments. The application is running on Google App Engine. Apache Tika does exactly what I need it to, but I'm running to a SecurityException when it…

java google-app-engine apache-tika

asked Sep 19 '13 at 16:28

user2357178

votes

0 answers

How to decode special characters with Apache Tika

I'm using Apache Tika to parse some MS Word documents to HTML (String). Problem is that some documents contains special characters (e.g. Mathematical Operators). Is any way how to solve it? Thank you for help. Input: Output Source…

java parsing ms-word apache-poi apache-tika

asked Aug 27 '13 at 21:24

Peter Jurkovic

2,686
6
36
55

Prev 1 2 3

…

85 86 Next