MLCP Copy command with redaction getting timed out

Question

ML version used: 9.0-10.4

Running the MLCP COPY command on large data set (39753201 docs). On running the command getting the below error.

2020-07-29 20:38:09 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2020-07-29 20:38:09 INFO  ContentPump:227 - Job name: local_1071163736_1
2020-07-29 20:38:10 INFO  MarkLogicInputFormat:420 - Fetched 6 forest splits.
2020-07-29 20:38:10 INFO  MarkLogicInputFormat:551 - Made 39757 split(s).
2020-07-29 20:38:11 INFO  LocalJobRunner:519 -  completed 0%
2020-07-29 20:48:10 ERROR DatabaseContentReader:286 - QueryException:com.marklogic.xcc.exceptions.XQueryException: XDMP-EXTIME: for $doc in $documents -- Time limit exceeded
 [Session: user=admin, cb=#17742233824102065206 [ContentSource: user=admin, cb=cndb [provider: address=localhost/127.0.0.1:8000, pool=0/64]]]
 [Client: XCC/9.0-10, Server: XDBC/9.0-10.4]
in /MarkLogic/redaction.xqy, on line 78
expr: for $doc in $documents,
in rdt:redact((fn:doc("doc-1.xml"), fn:doc("doc-2.xml"), fn:doc("doc-3.xml"), ...), ("numeric-rules", "rule-2", "binary-rules", ...))
in /eval, on line 9
expr: for $doc in $documents

Split parameters used:

max_split_size = 1000
 thread_count = 12

Not sure why getting the timed-out error. on running the redaction on 2000 docs in qconsole, it's taking only 10-15 secs time.

Modified the above error log to hide the sensitive info (like doc-1.xml)

Are some of the docs super large? Testing against your 2,000 docs went fast, but maybe some docs in that giant corpus are orders of magnitude larger than your test ones? — hunterhacker, Jul 30 '20 at 07:14
No docs aren't super large. `max_split_size = 1000` does this mean max docs will be picked in a thread is 1000?? I put a log in redaction.xqy (under redact function), the total count of $documents is coming as 6625576 (so seems like processing these many docs in one thread) above error is coming for all the threads. — Dixit Singla, Jul 30 '20 at 07:35
There was a bug fixed in the recent version of MarkLogic related to mlcp and redaction. Can you upgrade to mlcp 9.0-12 and try it? — James Kerr, Jul 30 '20 at 08:21
That will be a bit difficult for us at this moment? in the COPY command I was using query filter, on removing the query filter the $documents count is coming as 1000. Not sure why query_filter is causing the problem (the query filter param is `cts:not-query(cts:collectoin-query(('col-1', 'col-2')))`) — Dixit Singla, Jul 30 '20 at 08:25
I could be wrong but it sounds like you are hitting this issue: https://github.com/marklogic/marklogic-contentpump/pull/127 It was fixed in the 9.0-12 release of mlcp. I believe you can just use that release of mlcp with your current version of the server to test it out. Do you have an active support contract with MarkLogic? If so, I suggest opening a ticket there. — James Kerr, Jul 30 '20 at 09:59

MLCP Copy command with redaction getting timed out

0 Answers0