Questions tagged [solr-cell]

Solr Content Extraction Library: a SOLR contrib module responsible for converting the raw content of a rich document to something usable by Solr.

The Solr Cell's main component is the ExtractingRequestHandler, which uses Tika to allow users to upload binary files to Solr and have Solr extract text from it and then index it.

71 questions
1
vote
3 answers

Solr ExtractingRequestHandler giving empty content field

I'm using Solr 6.2.1 and ExtractingRequestHandler (already included in Solr 6.2.1) to index pdf and word documents. All documents (pdf and word) are indexed with metadata (title, date, cp_revision, compagny, ...) but the content field is always…
Marine Msg
  • 23
  • 2
1
vote
1 answer

SOLR Tika: add text of file to existing record (ExtractingRequestHandler)

I am indexing posts in SOLR with "name", "title", and "description" fields. I'd like to later be able to add a file (like a Word doc or a PDF) using Tika / the ExtractingRequestHandler. I know I can add documents like so: (or through other…
Matt Hampel
  • 5,088
  • 12
  • 52
  • 78
1
vote
1 answer

Ways to send binary/structured documents to SOLR?

I am using SOLR's ExtractingRequestHandler to ingest the text of documents. The examples in the documentation all use curl to stream documents, like so: curl 'http://.../extract?literal.id=doc1&commit=true' -F "myfile=@tutorial.html" That works…
Matt Hampel
  • 5,088
  • 12
  • 52
  • 78
1
vote
0 answers

Solr: Perform stemming on a field and get the sorted list of stemmed words which were most frequent

Is there a way that I can use stemming on a field at index time and then retrieve a sorted list of stemmed words by frequency of their original occurrence at query time. For example assume my 'text' field has contents of a document and contains only…
shaffooo
  • 1,478
  • 23
  • 28
1
vote
0 answers

How to remove a lot of "\n" in text extracted from a Word file using Solr?

when i index a .docx document , with Apache Solr 4.9 (solr cell); it extracts the text with a lot of "\n", is there some way to either clean the field content or remove the "\n"? field content looks like: "content": [ " \n \n \n \n \n \n …
kinopio
  • 21
  • 7
1
vote
5 answers

Can we search for .txt files in Solr search engine?

I am using solr search engine for my project purpose in document retrival. My dataset is in .txt file format. But solr gives options for json,xml,pdf and some other file formats only. There is no option for text files. Do I need some modifications…
Madhusudan
  • 435
  • 2
  • 9
  • 26
1
vote
1 answer

Error while indexing .xml files in solr

I am trying to index xml files in solr search engine using following command: java -Durl=http://10.1.11.143:8080/solr/#/ -jar post.jar solr.xml But I am getting following error: SimplePostTool version 1.5 Posting files to base url…
Madhusudan
  • 435
  • 2
  • 9
  • 26
1
vote
1 answer

Setting maximum string length in ExtractingRequestHandler ("Solr Cell") .. setMaxStringLength()

I'm using Solr and ExtractingRequestHandler to index documents but I do not know how to do the equivalent of Tika setMaxStringLength(). It appears to be indexing all of the smaller documents but not all of the text of a large document, which might…
mlevy
  • 87
  • 6
1
vote
1 answer

Solr: Excluding certain HTML tags or only including certain tags within indexes

I'm currently using Solr-Cell to grab the contents of several html pages and index them. The issue is that I have a menu in the header which is shown on all the pages. This Menu and all its items are appearing within the search results. I don't want…
mangesh
  • 100
  • 5
1
vote
0 answers

how to get date strings from content of pdf with apache solr

Hi all i am new to apache solr. i have a pdf which is containing date informations like - bla bla bla 2012-11-23 11:11:12 bla bla ...- i want to get all dates from content. i read some documentation…
1
vote
1 answer

how to make a association by using lucene/solr import record from database and doc file at same time

i store binary documents information (file meta) in database, and store binary documents in filesystem. using file name associate with file information in database. now i want to import all those data (file meta and full-text content in binary…
EeE
  • 665
  • 5
  • 12
  • 27
1
vote
0 answers

#500 Internal Server Error when trying to add PDF to Solr index with extraction

I am a first-time Solr user, using v3.5 with Tomcat 7 on a Windows 7 system. I went through the XML example in example-docs with no problems. However, I'm going to need to use extraction with HTML and PDF files, and when I try to Post a PDF file…
user1263226
  • 250
  • 3
  • 12
0
votes
1 answer

Apache Solr - indexing PDF files

Hi I have tried doing this with the binary distribution as well as compiled the source code my self. Tried running this with Apache Tomcat as well. But I am always getting the following error when I use a pdf file for indexing purposes. I am using…
SarfarazSoomro
  • 413
  • 4
  • 8
0
votes
3 answers

NoClassDefFoundError MimeTypeException with PDF extraction

I am getting an exception trying to use update/extract with PDF files My Set up is:- Ubuntu Server 11.10 Tomcat 6 Solr 3.5.0.2011.11.22.15.54.38 I can browse to solr/admin OK I have put all the contrib/extract and apache-solr-cell3.5.0.jar libraries…
paulusm
  • 786
  • 6
  • 19
0
votes
1 answer

Solr ExtractingRequestHandler pdf text extraction

I've a problem with the pdf text extraction of Solr. Solr uses Apache Tika for extracting the text of a PDF file and tika uses PDFBox for that. When I send my PDF file to Solr it extracts the text successfully, but the text is totally messed…
itsme
  • 852
  • 1
  • 10
  • 23