MarkLogic - S3 Import

Question

Can we import data from Amazon S3 into MarkLogic using

JavaScript/xQuery API
MarkLogic Content Pump
Any other way?

Please share the reference, if available.

score 3 · Accepted Answer · answered Nov 17 '16 at 16:24

I'm not an AWS expert by any stretch, but I if you know the locations of data on S3, you can use xdmp:document-get(), with an http:// prefix in the $location, to retrieve documents. You can also use xdmp:http-get(), perhaps to query for the locations of your documents. Once that command has returned, you can use the usual xdmp:document-insert.

That approach should be fine for a small number of documents. If you have a large set you want to import, you'll have to factor in the possibility of the transaction timing out.

For a larger data set, you might want to manage the process externally. Here are a couple options:

export data from S3 onto your local filesystem, then use MLCP to send it to MarkLogic
insert a document that has a list of resources at S3 that you want to import; spawn tasks that will each take a group of those resources and import them using xdmp:document-get()
use Java code to pull a document (or batch of documents) from S3, then use the Java Client API to insert that data into MarkLogic
once MarkLogic 9 comes out, use the Data Movement SDK, which is intended to make projects like this easier (as of this writing, the DMSDK is still in development)

@dave-cassel Thank you for the response. Can _MLCP_ somehow work for data on S3? It runs a MapReduce job, so ideally it should. Importing terabytes of data on the local disk may not be the optimal solution. Also, if the cluster is in AWS, data transfer be 2x (download, and then bulk upload using MLCP). — blackzero, Nov 22 '16 at 08:05
@blackzero MLCP only knows two input sources to work with: the file system and MarkLogic itself (for copy and export operations). For MarkLogic 8, I think your best bets are either options 2 or 3 above. — Dave Cassel, Nov 22 '16 at 14:37

score 0 · Answer 2 · edited Nov 20 '17 at 23:19

Load test.xml file from AWS S3 bucket into the database associated with your REST API instance using the /documents service:

curl https://s3.amazonaws.com/yourbucket/test.xml | curl -v --digest --user user:password -H "Content-Type: application/xml" -X PUT -d @- "localhost:8052/v1/documents?uri=/docs/test.xml"

replace https://s3.amazonaws.com/yourbucket/test.xml with valid URL of AWS S3 storage
replace user:password with valid values
replace localhost:8052 with URL of your MarkLogic app server

score 0 · Answer 3 · edited Dec 19 '17 at 19:24

0

Recently I faced the same issue and I used the following MLCP code for copying data over, and it worked.

mlcp export -host {host} -port {port} -username {username} -password {password} -output_file_path {S3 path} -collection_filter {collection name to be moved}

edited Dec 19 '17 at 19:24

Adil B

14,635
11
60
78

answered Dec 19 '17 at 14:45

Amit Gope

120
1
10

score 0 · Answer 4 · answered Dec 23 '17 at 16:28

0

If you configure your aws credententials in the admin tool, you can use a URL of the form "s3://bucket/key" to access S3 for read or write.

See EC2 guide See Stackoverflow similar question

answered Dec 23 '17 at 16:28

DALDEI

3,722
13
9

MarkLogic - S3 Import

4 Answers4