Questions tagged [solr-cell]

Solr Content Extraction Library: a SOLR contrib module responsible for converting the raw content of a rich document to something usable by Solr.

The Solr Cell's main component is the ExtractingRequestHandler, which uses Tika to allow users to upload binary files to Solr and have Solr extract text from it and then index it.

71 questions
3
votes
1 answer

Solr ExtractingRequestHandler giving empty content for pdf documents

I am using ExtractingRequestHandler in Solr for getting document content and index it. It works fine for all Microsoft Documents, but for PDFs, the content being extracted is empty. I have also tried the extractOnly=true with curl, and that also…
aseem
  • 800
  • 1
  • 9
  • 13
3
votes
4 answers

Get page numbers of searchresult of a pdf in solr

I'm building a web application where users can search for pdf documents and view them with pdf.js. I would like to display the search results with a short snippet of the paragraph where the search term where found and a link to open the document at…
Gesh
  • 565
  • 1
  • 6
  • 21
3
votes
0 answers

Getting date metadata using SolrCell

I'm using Solr 3.6 to index many different types of documents. I have several fields that define common information for all the documents, one of them being 'date' (ideally last modified date, just something to indicate how recent a document…
The Doge Prince
  • 458
  • 1
  • 7
  • 15
2
votes
1 answer

Solr open document after searching a keyword

I am trying to index some PDF documents and then create a Search UI . This question is somewhat related to Solr Index PDF documents and post them to a remote server 1) Indexing PDF Docs - > I use tika jar to convert PDF to text files and then use…
Balaji.N.S
  • 745
  • 3
  • 13
  • 28
2
votes
1 answer

How to index pdf's content with SolrJ?

I'm trying to index a few pdf documents using SolrJ as described at http://wiki.apache.org/solr/ContentStreamUpdateRequestExample, below there's the code: import static org.apache.solr.handler.extraction.ExtractingParams.LITERALS_PREFIX; import…
alessmar
  • 4,689
  • 7
  • 43
  • 52
2
votes
1 answer

Indexing pdf documents

What the best way to index pdf documents? Should I index them by converting pdf documents to txt or there is a better way to index pdf files?
Ahsan Iqbal
  • 1,422
  • 5
  • 20
  • 39
2
votes
2 answers

Using Zend Lucene to search Office 2003 or older files

I know there are already objects supporting Office 2007 files, but is there any native Office 2003 or earlier support ?
Amadeus45
  • 1,228
  • 2
  • 17
  • 28
1
vote
1 answer

No results when searching indexed PDF with Solr Cell

i've been working with Solr for a while, i recently tried the solr-cell component and i'm indexing some PDFs, however im having the exact same problem presented in this thread. When I search for *:* in the admin console, the PDFs are listed. However…
jag
  • 11
  • 1
1
vote
0 answers

Solr Get Paragraphs of Documents

I've been working with solr for a couple of days, and I need to split a document into its paragraphs and then search on every one of them. I tried a lot of things, but solr just doesn't want to capture paragraphs correctly; either it captures…
1
vote
2 answers

Tika Solr Metadata mapping ignore document title

I have the following config file for solr: