Could MLCP Content Transformation and Triggers be used together during document ingestion?

Question

As I understand, both the MLCP Transformation and Trigger can be used to modify ingested documents. The difference is that content transformation operates on the in-memory document object during the ingestion, whereas Trigger can be fired after a document is created.

So it seems to me there is no reason why I cannot use both of them together. My use cases is that I need to update some nodes of the documents after they are ingested to the database. The reason I use trigger is because running the same logic in MLCP transformation using the in-mem-update module always caused ingestion failure, presumably due to the large file size and the large number of nodes I attempted to update.

2018-08-22 23:02:24 ERROR TransformWriter:546 - Exception:Error parsing HTTP headers: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

So far, I have not been able to combine Content Transformations and Triggers. When I enabled transformation during MLCP ingestion, the trigger was not fired. When I disabled the transformation, the trigger worked without problem.

Is there any intrinsic reason why I cannot use both of them together? Or is it an issue related to my configuration? Thanks!

Edit:

I would like to provide some context for clarification and report results based on suggestions from @ElijahBernstein-Cooper, @MadsHansen and @grtjn (thanks!). I am using the MarkLogic Data Hub Framework to ingest PDF files (some are quite large) as binaries and extract the text as XML. I essentially followed this example, except that I am using xdmp:pdf-convert instead of xdmp:document-filter: https://github.com/marklogic/marklogic-data-hub/blob/master/examples/load-binaries/plugins/entities/Guides/input/LoadAsXml/content/content.xqy

While xdmp:pdf-convert seems to preserve the PDF structure better than the xdmp:document-filter, it also includes some styling nodes (<link> and <style>) and attributes (class and style) which I do not need. In attempting to remove them I explored two different approaches:

The first approach is to use the in-mem-update module to remove the unwanted nodes from the in-memory document representation within the above content.xqy script, as part of the content transformation flow. The problem is that the process can be quite slow, and as @grtjn pointed out I have to limit parallelization to avoid timeout.
The second approach is to use a post-commit trigger function to modify the documents using xdmp:node-delete after they have been ingested into the database. However, the trigger won't fire when the triggering condition is set to be document-content("create"). It does trigger if I change the condition to document-content("modify"), but for some reason I cannot access the document using fn:document($trgr:uri) similar to this SO question (MarkLogic 9 sjs trigger not able to acces post-commit() document data).

A quick test you could try is to use a transform in MLCP that does not change the node. This would let you know if the problem is associated with how you are transforming the documents. — Elijah Bernstein-Cooper, Aug 23 '18 at 12:47
Are there any other relevant messages in the ErrorLog? Is it possible that your transaction is timing out? You could be under the threshold when performing half the work, but when executing the transform and the trigger, it takes too long for the configured limits. — Mads Hansen, Aug 23 '18 at 13:20
@ElijahBernstein-Cooper Thanks for the suggestion! I tested with a minimal transformation module that just returns the input content, and the trigger was fired successfully. So I hope it narrowed down the scope. My actual transformation is part of the data-hub framework, in which I invoke xdmp:pdf-convert to process the input PDF files. Does it have a conflict with trigger? — Fan Li, Aug 23 '18 at 13:48
@MadsHansen Thanks! I got the error message when I attempted to update the document nodes using in-mem-update in the transformation module (trigger was not used in the case). I found the in-mem-update to be much slower than the corresponding xdmp:node-* functions, which is why I opted to use trigger to modify the documents after the ingestion. — Fan Li, Aug 23 '18 at 13:55
@FanLi since the trigger fired when you returned the raw node in the MLCP transform, I suspect that your original transform is returning a node / an event that doesn't match the trigger criteria. I don't see any reason why xdmp:pdf-convert would conflict with a trigger. I'd suggest reviewing the trigger event config: https://docs.marklogic.com/trgr:trigger-data-event — Elijah Bernstein-Cooper, Aug 23 '18 at 15:03
The error message sounds mostly like a client-side time out to me. Have you tried trimming down on threads and transaction size? Try `-transaction_size 1 -batch_size 1 -thread_count 1` — grtjn, Aug 23 '18 at 15:34
@ElijahBernstein-Cooper You are correct. After I changed the triggering criteria from `document-content("create")` to `document-content("modify")`, the trigger got fired during ingestion. However I don't understand why it is the case. Now my problem is that I am not able to get hold of the document by using `fn:document($trgr:uri)`, which may be related to this question: https://stackoverflow.com/q/47856917/3546482 — Fan Li, Aug 24 '18 at 03:42
@grtjn Thanks. Your configuration worked with the in-mem-update module, although the ingestion became very slow. Is there anyway to address the timeout without sacrificing the speed too much? — Fan Li, Aug 24 '18 at 04:46
You can scale up thread_count, but do so in small steps. Check system resources to monitor for overloading cpu or mem. If you try to process too much in parallel on one host, you'll just choke the system, and timeouts will re-appear. You could also scale out your cluster. You could also do the pdf-convert in post-commit triggers, or in spawned processes. CPF could be useful for that. Just be careful not to flood the task queue.. — grtjn, Aug 24 '18 at 06:53

score 2 · Accepted Answer · answered Aug 24 '18 at 07:24

MLCP Transforms and Triggers operate independently. There is nothing in those Transforms that should stop Triggers from working per se.

Triggers are triggers by events. I typically use both a create and a modify trigger to cover the cases where I import the same files a second time (for testing purposes for instance).

Triggers also have a scope. They are configured to look for either a directory or a collection. Make sure your MLCP configuration matches the Trigger scope, and that your Transform does not influence the URI in such a way that it no longer matches directory scope if that is used.

Looking more closely to the error message however, I'd say that is caused by a timeout. Timeouts can occur both server-side (10 min by default), and client-side (might depend on client-side settings, but could be much smaller). The message basically says that the server took too long to respond, so I'd say you are facing a client-side timeout.

Timeouts can be caused by too small time-limits. You could try to increase timeout settings both server-side (xdmp:set-request-time-limit()), and client-side (not sure how to do that in Java).

It is more common though, that you are simply trying to do too much at the same time. MLCP opens transactions, and tries to execute a number of batches within that transaction, aka the transaction_size. Each batch contains a number of documents to the size of batch_size. By default MLCP tries to process 10 x 100 = 1000 documents per transaction.

It also runs with 10 threads by default, so it typically opens 10 transactions at the same time, and tries to run 10 threads to process a 1000 docs each in parallel. With simple inserts this is just fine. With more heavy processing in transforms or pre-commit triggers, this can become a bottle-neck, particularly when the threads start to compete for server resources like memory and cpu.

Functions like xdmp:pdf-convert can often be fairly slow. It depends on an external plugin for starters, but also imagine it has to process a 200 page PDF. Binaries can be large. You'll want to pace down to process them. If using -transaction_size 1 -batch_size 1 -thread_count 1 makes your transforms work, you really were facing timeouts, and may have been flooding your server. From there you can look at increasing some numbers, but binary sizes can be unpredictable, so be conservative.

It might also be worth looking at doing heavy processing asynchronously, for instance using CPF, the Content Processing Framework. It is a very robust implementation for processing content, and is designed to survive server restarts.

HTH!

Thanks, @grtjn. I will tune down the parallelization based on your advice. I will look into CPF for a more robust solution in the future. — Fan Li, Aug 24 '18 at 14:50

Could MLCP Content Transformation and Triggers be used together during document ingestion?

1 Answers1