0

I'd like to have a simple setup of solr where I can index and search large folders of pdf/docx files. I mostly need just full text search, no need to have fields separated and the original documents do not seem to have well defined structure anyway. I follow https://lucene.apache.org/solr/quickstart.html which is straightforward, however, when I try to index my own folder with some pdf files, some files return error like:

POSTing file G1504225.pdf (application/pdf) to [base]/extract
SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for 
url: http://localhost:8983/solr/gettingstarted/update/extract?
resource.name=%2Fhome%2Fsolr%2Fsolr-6.5.1%2F..%2Ftrain_data%2FG1504225.pdf&literal.id=%2Fhome%2Fsolr%2Fsolr-6.5.1%2F..%2Ftrain_data%2FG1504225.pdf
SimplePostTool: WARNING: Response: <?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">400</int><int 
name="QTime">263</int></lst><lst name="error"><lst name="metadata"><str 
name="error-class">org.apache.solr.common.SolrException</str><str 
name="root-error-class">java.lang.NumberFormatException</str><str 
name="error-class">org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException</str><str name="root-error-class">org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException</str></lst><str name="msg">Async exception during distributed update: Error from server at http://127.0.1.1:8983/solr/gettingstarted_shard2_replica1: Bad Request

request: 
http://127.0.1.1:8983/solr/gettingstarted_shard2_replica1/update?update.chain=add-unknown-fields-to-the-schema&amp;update.distrib=TOLEADER&amp;distrib.from=http%3A%2F%2F127.0.1.1%3A8983%2Fsolr%2Fgettingstarted_shard1_replica1%2F&amp;wt=javabin&amp;version=2
Remote error message: ERROR: [doc=/home/solr/solr-6.5.1/../train_data/G1504225.pdf] Error adding field 'title'='United Nations' msg=For input string: "United Nations"</str><int name="code">400</int></lst>
</response> 
SimplePostTool: WARNING: IOException while reading response: 
java.io.IOException: Server returned HTTP response code: 400 for URL: 
http://localhost:8983/solr/gettingstarted/update/extract?
resource.name=%2Fhome%2Fsolr%2Fsolr-6.5.1%2F..%2Ftrain_data%2FG1504225.pdf&literal.id=%2Fhome%2Fsolr%2Fsolr-6.5.1%2F..%2Ftrain_data%2FG1504225.pdf

Most of the files are fine and I can search them. Any ideas?

kakk11
  • 898
  • 8
  • 21

1 Answers1

0

Solr uses Tika to extract the text from those files. Some types of files, pdf specially, are hard to parse, as it is a proprietary format and Tika is always trying to catch up edge cases etc. So it is normal that some files will throw errors. You have to expect that.

See how many instances of NumberFormatException/pdfbox are found...(pdfbox is the library Tika uses for pdf files).

If you really want to get all the text from all pdf, even the ones erroring, you can put them in a special folder, and process them again extracting the text yourself with another library, different libraries will have different results of the same pdf, so you can use the superset of the text several libraries produce. But you will have to write some glue code for this, unless Tika allows you to plug specific libraries for specific file types (not sure if it does now, it didn't do that before).

Persimmonium
  • 15,593
  • 11
  • 47
  • 78
  • Thanks. The pdf that fails is actually very simple and all the relevant text is easily extracted with pdf2txt. Also, the error message does not complain about extracting data from pdf, rather about inserting it to solr, but I may miss something here of course. Anyway, just converting all the pdf-s to txt with pdf2txt and then indexing seems to work and this is my quick hack for the moment. – kakk11 May 22 '17 at 11:39