As I understand, both the MLCP Transformation and Trigger can be used to modify ingested documents. The difference is that content transformation operates on the in-memory document object during the ingestion, whereas Trigger can be fired after a document is created.
So it seems to me there is no reason why I cannot use both of them together. My use cases is that I need to update some nodes of the documents after they are ingested to the database. The reason I use trigger is because running the same logic in MLCP transformation using the in-mem-update
module always caused ingestion failure, presumably due to the large file size and the large number of nodes I attempted to update.
2018-08-22 23:02:24 ERROR TransformWriter:546 - Exception:Error parsing HTTP headers: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
So far, I have not been able to combine Content Transformations and Triggers. When I enabled transformation during MLCP ingestion, the trigger was not fired. When I disabled the transformation, the trigger worked without problem.
Is there any intrinsic reason why I cannot use both of them together? Or is it an issue related to my configuration? Thanks!
Edit:
I would like to provide some context for clarification and report results based on suggestions from @ElijahBernstein-Cooper, @MadsHansen and @grtjn (thanks!). I am using the MarkLogic Data Hub Framework to ingest PDF files (some are quite large) as binaries and extract the text as XML. I essentially followed this example, except that I am using xdmp:pdf-convert
instead of xdmp:document-filter
: https://github.com/marklogic/marklogic-data-hub/blob/master/examples/load-binaries/plugins/entities/Guides/input/LoadAsXml/content/content.xqy
While xdmp:pdf-convert
seems to preserve the PDF structure better than the xdmp:document-filter
, it also includes some styling nodes (<link>
and <style>
) and attributes (class
and style
) which I do not need. In attempting to remove them I explored two different approaches:
- The first approach is to use the
in-mem-update
module to remove the unwanted nodes from the in-memory document representation within the abovecontent.xqy
script, as part of the content transformation flow. The problem is that the process can be quite slow, and as @grtjn pointed out I have to limit parallelization to avoid timeout. - The second approach is to use a post-commit trigger function to modify the documents using
xdmp:node-delete
after they have been ingested into the database. However, the trigger won't fire when the triggering condition is set to bedocument-content("create")
. It does trigger if I change the condition todocument-content("modify")
, but for some reason I cannot access the document usingfn:document($trgr:uri)
similar to this SO question (MarkLogic 9 sjs trigger not able to acces post-commit() document data).