1

when i index a .docx document , with Apache Solr 4.9 (solr cell); it extracts the text with a lot of "\n", is there some way to either clean the field content or remove the "\n"?

field content looks like:

"content": [
      " \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n \n   Solr  es un motor de búsqueda de código abierto basado en la biblioteca Java del proyecto Lucene, con APIs en XML/HTTP y  JSON , resaltado de resultados, búsqueda por facetas, caché, y una interfaz para su administración \n    \n  "

here is the code, i'm using SolrJ, java, tomcat 8, Apache Solr 4.9, also i tried to modificate schema.xml, using regex on the tokenizer to replace the "\n" with "" (blank), also another way but however nothing made it work

the code is here :

  SolrServer solrServer = new HttpSolrServer(url, httpClient);
  ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract");    
  up.addFile(new File("C:\\doc.docx"),"");
  up.setParam("literal.id", "indexDoc.docx");
  up.setParam("field", "anything");
  up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);

  NamedList<Object> result = solrServer.request(up);
  String y = "";

  rsp = solrServer.query( new SolrQuery( "id:indexDoc.docx") );
  System.out.println(rsp.toString()); `
kinopio
  • 21
  • 7
  • How did you tried to replace "\n" for blank? – BernardoLima Aug 26 '14 at 02:26
  • @BernardoLima I tried including \n in the stopwords.txt file, i also tried with the PatternReplaceCharFilterFactory () but dind't work either.. (I speak Spanish) – kinopio Aug 26 '14 at 02:48
  • @BernardoLima, pattern="([/\n])", doesn't work, :( – kinopio Aug 26 '14 at 02:59
  • I'm sorry, your pattern was fine, I'm goning to try to think about another approach. – BernardoLima Aug 26 '14 at 03:04
  • 1
    Last shot, please try the following code: `` – BernardoLima Aug 26 '14 at 03:07
  • @BernardoLima I added that and now solr broke,in netbeans ouput: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/io/IOUtils, and in Solr adminUI : Estado HTTP 500 - {msg=SolrCore 'collection1' is not available due to init failure: Could not load core configuration for core collection1,trace=org.apache.solr.common.SolrException: SolrCore 'collection1' is not available due to init failure: Could not load core configuration for core collection1 – kinopio Aug 26 '14 at 03:12
  • Sorry, I don't know much about Solr, but I suspect that maybe, the code you said that isn't working on your first comment, is replacing only the first occurrence, that's why I said to you to try to add the command to replace all. – BernardoLima Aug 26 '14 at 03:17

0 Answers0