when i index a .docx document , with Apache Solr 4.9 (solr cell); it extracts the text with a lot of "\n", is there some way to either clean the field content or remove the "\n"?
field content looks like:
"content": [
" \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n Solr es un motor de búsqueda de código abierto basado en la biblioteca Java del proyecto Lucene, con APIs en XML/HTTP y JSON , resaltado de resultados, búsqueda por facetas, caché, y una interfaz para su administración \n \n "
here is the code, i'm using SolrJ, java, tomcat 8, Apache Solr 4.9, also i tried to modificate schema.xml, using regex on the tokenizer to replace the "\n" with "" (blank), also another way but however nothing made it work
the code is here :
SolrServer solrServer = new HttpSolrServer(url, httpClient);
ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract");
up.addFile(new File("C:\\doc.docx"),"");
up.setParam("literal.id", "indexDoc.docx");
up.setParam("field", "anything");
up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
NamedList<Object> result = solrServer.request(up);
String y = "";
rsp = solrServer.query( new SolrQuery( "id:indexDoc.docx") );
System.out.println(rsp.toString()); `