1

I am trying to calculate the doc size of the XML in MarkLogic in part of performance. Could someone please help me with any inbuilt functions or any query where I can calculate my accurate size of the document? I have a formula like:

{string-length(string(data($doc))) idiv 2}
Mads Hansen
  • 63,927
  • 12
  • 112
  • 147
Aswanikumar
  • 115
  • 1
  • 9

2 Answers2

7

If by 'size' you mean how big the XML document would be if it were serialized as text ('to disk'),

 string-length(xdmp:quote( doc('file.xml') )) 

Will give you the number of characters using the default encoding and serialization options.
That will vary from 1:1 (characters to bytes) to 1:3, if using UTF8, depending on the distribution of Unicode characters and the difference between the serializaiton options specified to xdmp:quote() and the analogous formatting before ingestion (or after exporting). For Latin languages and default settings it is usually close to 1:1 -- To get more accurate you need to specific the exact serialization and encoding options and either save the document to file system or convert to binary and take the binary length. Even then it will be file system and OS dependent (block size, text encoding etc).

If by 'size' you mean how much disk / memory the document 'uses' inside marklogic that can determined statistically by taking a snapshot of the disk space used in all data directories, then inserting a large number of documents and taking another snapshot then dividing by the # of documents. It will vary, possibly greatly, depending on many factors such as indexing settings, similarity between documents, merge rates and limits etc.

Documents are stored in a highly compressed form, typically much smaller then the text size, but indexing options add to the total size ... Both depend greatly how much similarity of terms/tokens/substrings different documents share.

If by size you mean how much memory a document will take when accessed, that is even more variable and less easily measurable. It can range from 0x (queries entirely resolved by index) to 10x or more for highly structured documents with little or no text content.

DALDEI
  • 3,722
  • 13
  • 9
6

The easiest way I've found to calculate the raw document size (before indexes are taken into account) is to covert the document into a binary and us xdmp:binary-size().

Here is an example of how you'd do that

xdmp:binary-size(xdmp:unquote(xdmp:quote($doc),(),"format-binary")/binary())
Tyler Replogle
  • 1,339
  • 7
  • 13
  • HOw can i get the same value in MB or either in KB ? – Aswanikumar Aug 08 '16 at 21:25
  • 3
    xdmp:binary-size returns the an int that is bytes so all you need to do is convert it to KB or MB. So 1 KB is 1024 bytes 1 MB is 1024 KB. So xdmp:binary-size(.) div 1024 would give you KB xdmp:binary-size(.) div 1024 div 1024 would give you MB. – Tyler Replogle Aug 10 '16 at 15:21