2

Can we import data from Amazon S3 into MarkLogic using

  1. JavaScript/xQuery API
  2. MarkLogic Content Pump
  3. Any other way?

Please share the reference, if available.

blackzero
  • 78
  • 7

4 Answers4

3

I'm not an AWS expert by any stretch, but I if you know the locations of data on S3, you can use xdmp:document-get(), with an http:// prefix in the $location, to retrieve documents. You can also use xdmp:http-get(), perhaps to query for the locations of your documents. Once that command has returned, you can use the usual xdmp:document-insert.

That approach should be fine for a small number of documents. If you have a large set you want to import, you'll have to factor in the possibility of the transaction timing out.

For a larger data set, you might want to manage the process externally. Here are a couple options:

  • export data from S3 onto your local filesystem, then use MLCP to send it to MarkLogic
  • insert a document that has a list of resources at S3 that you want to import; spawn tasks that will each take a group of those resources and import them using xdmp:document-get()
  • use Java code to pull a document (or batch of documents) from S3, then use the Java Client API to insert that data into MarkLogic
  • once MarkLogic 9 comes out, use the Data Movement SDK, which is intended to make projects like this easier (as of this writing, the DMSDK is still in development)
Dave Cassel
  • 8,352
  • 20
  • 38
  • I'd highly recommend the Java Client API for this use case. – Sam Mefford Nov 21 '16 at 15:08
  • @dave-cassel Thank you for the response. Can _MLCP_ somehow work for data on S3? It runs a MapReduce job, so ideally it should. Importing terabytes of data on the local disk may not be the optimal solution. Also, if the cluster is in AWS, data transfer be 2x (download, and then bulk upload using MLCP). – blackzero Nov 22 '16 at 08:05
  • 1
    @blackzero MLCP only knows two input sources to work with: the file system and MarkLogic itself (for copy and export operations). For MarkLogic 8, I think your best bets are either options 2 or 3 above. – Dave Cassel Nov 22 '16 at 14:37
0

Load test.xml file from AWS S3 bucket into the database associated with your REST API instance using the /documents service:

curl https://s3.amazonaws.com/yourbucket/test.xml | curl -v --digest --user user:password -H "Content-Type: application/xml" -X PUT -d @- "localhost:8052/v1/documents?uri=/docs/test.xml"
  • replace https://s3.amazonaws.com/yourbucket/test.xml with valid URL of AWS S3 storage
  • replace user:password with valid values
  • replace localhost:8052 with URL of your MarkLogic app server
Machavity
  • 30,841
  • 27
  • 92
  • 100
mg_kedzie
  • 337
  • 1
  • 9
0

Recently I faced the same issue and I used the following MLCP code for copying data over, and it worked.

mlcp export -host {host} -port {port} -username {username} -password {password} -output_file_path {S3 path} -collection_filter {collection name to be moved}
Adil B
  • 14,635
  • 11
  • 60
  • 78
Amit Gope
  • 120
  • 1
  • 10
0

If you configure your aws credententials in the admin tool, you can use a URL of the form "s3://bucket/key" to access S3 for read or write.

See EC2 guide See Stackoverflow similar question

DALDEI
  • 3,722
  • 13
  • 9