0

I'm using Data Import Handler (DIH) to create documents in solr. Each document will have zero or more attachments. The attachments' (e.g. PDFs, Word docs, etc.) content is parsed (via Tika) and stored along with a path to the attachment. The attachment's content (and path) is (are) not stored in the database (and I prefer not to do that).

I currently have a schema with all the fields needed by DIH. I then also added an attachmentContent and attachmentPath field as multiValued. However, when I use Solrj to add the documents, only one attachment (the last one added) is stored and indexed by solr. Here's the code:

        ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract");
        up.setParam("literal.id", id);

        for (MultipartFile file : files) {
            // skip over files where the client didn't provided a filename
            if (file.getOriginalFilename().equals("")) {
                continue;
            }
            File destFile = new File(destPath, file.getOriginalFilename());
            try {
                file.transferTo(destFile);

                up.setParam("literal.attachmentPath", documentWebPath + acquisition.getId() + "/" + file.getOriginalFilename());
                up.addFile(destFile);   
            } catch (IOException ioe) {
                ioe.printStackTrace();   
            }               
        }
        try {
            up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);            
            solrServer.request(up);
        } catch (SolrServerException sse) {
            sse.printStackTrace();
        }catch (IOException ioe) {
            ioe.printStackTrace();   
        }

How can I get multiple attachments (content and paths) to be stored by solr? Or is there a better way to accomplish this?

James
  • 2,876
  • 18
  • 72
  • 116

1 Answers1

1

Solr has a limitation of having only one document indexed with the API.
If you want to have multiple documents indexed you can club them as a zip file (and apply patch) and have it indexed.

Jayendra
  • 52,349
  • 4
  • 80
  • 90
  • Thanks Jayendra. I'm using solr 3.6.1. Does this patch apply to that version? Also, I'm not sure how to apply the patch. When I click on SOLR-2416_ExtractingDocumentLoader.patch, I see text that indicates a coding change. Do I need to download the solr source, make the coding change and rebuild? – James Oct 26 '12 at 15:41
  • Yup it would apply to 3.6.1 as well and its a code change. You would need to apply the patch on source code and rebuild – Jayendra Oct 26 '12 at 17:54
  • Thanks for that information. Unfortunately, I don't think this will work with DIH. I can store the attachments via one zip under one solr document record as you suggested. The problem is that I'm using DIH to populate the remaining fields in the document record. But when DIH runs (even with delta-import and clean=false), it removes indexing of the zip file (i.e. DIH seems to delete the document record and then recreates it with the database data). The zip file is big step to solving this, but DIH is the other part. Any other thoughts on how to accomplish? – James Oct 29 '12 at 15:19
  • The delta import would work the same as full import and reindex the document with db and zip file. The Tika Processor has a different patch to handle the zip file. But this should work fine as well. – Jayendra Oct 29 '12 at 15:30
  • Thanks for the quick response. I apologize; I should have been more clear. When I do a delta import, I do not have the zip file. The DIH runs on a schedule to get data only from the database and I do not store the zip file contents in the database. I have code (above) that handles initially processing of the file (now zip file) and it also create the database entry. But when the DIH runs via a schedule it will not have the file contents. I was hoping that the DIH would not overwrite my text field for the file contents but would only overwrite the fields as indicated in the DIH query. – James Oct 29 '12 at 15:48
  • If you are using Solr 4.0 you can partial update your document without overwriting the text field. However, am not sure if it works with DIH. – Jayendra Oct 30 '12 at 03:36
  • I'm using 3.6.1, but I could switch to Solr 4.0. I might give that a shot. I figured out that I can run delta-import through code and specify a unique key to delta-import a specific document. So, I now longer need to schedule the run. When my app goes to insert or update the database, it'll also have the attachments. So, the app can call data-import first and followed by the update extract request. That'll work great. The only potential problem is searches within the zip. If I find text within the zip (via your patch), will I be able to tell which file(s) in the zip matched the search text? – James Oct 30 '12 at 16:38
  • Another alternative that I'm considering is multiple doc type within the single index (ala http://searchhub.org/dev/2011/02/12/solr-powered-isfdb-part-4/). I'll have one doc type referring to DIH data and another referring to attachment (binary file) data. The attachment data would include a "foreign key" field to reference back to the DIH data. This seems like a less complicated approach than above and I don't have to worry about also having to zip and have the zip around to delta-import DIH. Do you (or anyone else) have thoughts on this approach versus the one above? – James Oct 30 '12 at 16:43