3

In Xquery 3.1 (in eXist 4.7) I have 40 XML files, and I need to select 4 of them at random. However I would like the four files to be different.

My files are all in the same collection ($data). I currently count the files, then use a randomising function (util:random($max as xs:integer)) to generate position() in sequence of files to select four of them:

let $filecount := count($data)
for $cnt in 1 to 4
let $pos := util:random($filecount)
return $data[position()=$pos]

But this often results in the same files being selected multiple times by chance.

Each file has a distinct @xml:id (in the root node of each file) which can allow me, if possible, to use that as some sort of predicate in recursion. But I'm unable to identify a method for somehow accruing the @xml:ids into a cumulative, recursive sequence.

Thanks for any help.

jbrehr
  • 775
  • 6
  • 19

2 Answers2

4

I think the standardized random-numer-generator function and its permute function (https://www.w3.org/TR/xpath-functions/#func-random-number-generator) should give you better "randomness" and diverse results e.g.

let $file-count := count($data)
return $data[position() = random-number-generator(current-dateTime())?permute(1 to $file-count)[position() le 4]]

I haven't tried that with your db/XQuery implementation and it might be there are also ways with the functions you currently use.

For eXist-db I guess one strategy is to call the random-number function until you have got a distinct sequence of the wanted number of values, the following returns (at least in some tests with eXide)) four distinct numbers between 1 and 40 on each call:

declare function local:random-sequence($max as xs:integer, $length as xs:integer) as xs:integer+ {
    local:random-sequence((), $max, $length)
};

declare function local:random-sequence($seq as xs:integer*, $max as xs:integer, $length as xs:integer) as xs:integer+ {
    if (count($seq) = $length and $seq = distinct-values($seq))
    then $seq
    else local:random-sequence((distinct-values($seq), util:random($max)), $max, $length)
};

let $file-count := 40
return local:random-sequence($file-count, 4)

Integrating that in the previous attempt would result in

let $file-count := count($data)
return $data[position() = local:random-sequence($file-count, 4)]

As for your comment, I didn't notice the exist util:random function can return 0 and excludes the max value so based on your comment and a further test I guess you rather want the function I posted above to be implemented as

declare function local:random-sequence($seq as xs:integer*, $max as xs:integer, $length as xs:integer) as xs:integer+ {
    if (count($seq) = $length)
    then $seq
    else
        let $new-number := util:random($max + 1)
        return if ($seq = $new-number or $new-number = 0)
               then local:random-sequence($seq, $max, $length)
               else local:random-sequence(($seq, $new-number), $max, $length)
};

That way it hopefully now returns $length distinct values between 1 and the $max argument.

Martin Honnen
  • 160,499
  • 6
  • 90
  • 110
  • That's certainly the way I would do it. – Michael Kay Oct 23 '19 at 07:20
  • Although not something you can answer to, eXist is telling me `Function fn:random-number-generator() is not defined in module namespace: http://www.w3.org/2005/xpath-functions `. This despite it being in eXist documentation: http://www.exist-db.org/exist/apps/fundocs/view.html?uri=http://www.w3.org/2005/xpath-functions&location=java:org.exist.xquery.functions.fn.FnModule&details=true I note this here in case someone else searches on the problem... – jbrehr Oct 23 '19 at 07:32
  • 1
    I have tried the online exide web page and could get `let $file-count := 40 return random-number-generator((current-dateTime() - xs:dateTime("1970-01-01T00:00:00-00:00")) div xs:dayTimeDuration('PT1S')) ?permute(1 to $file-count)[position() le 4]` to work in it to return a sequence of four random numbers between 1 and 40. – Martin Honnen Oct 23 '19 at 07:47
  • 1
    Indeed. It turns out it's [only available as of eXist 5.0](https://exist-db.org/exist/apps/wiki/blogs/eXist/eXistdb500) (released last month, and is the environment for the public eXide you used). Unfortunately that makes the function effectively unavailable to me. – jbrehr Oct 23 '19 at 08:08
  • 1
    @jbrehr, see the edit, unless there is some way an eXist expert can tell you how to ensure its random generators give some better randomness you could try to call the function until distinct-values assures you the result is four (or whatever number you need) different values. – Martin Honnen Oct 23 '19 at 08:50
  • Thanks for that solution, works great! (although it will take me a while to understand how you've used the same function name `random-sequence` twice ) – jbrehr Oct 23 '19 at 09:29
  • @MartinHonnen There is one peculiarity of this function - it can sometimes return a zero in the sequence, which would ultimately return no file in that position. I've tried to add a test for `0` against sequence `$seq` but I've had no success. How would you do that? – jbrehr Oct 23 '19 at 11:09
  • 1
    @jbrehr, I have changed to implementation of the function in the answer to exclude any `0` value and also to adapt the function to use the `util:random` function with `$max + 1` so that the maximum value (e.g. `40`) can be returned as well. – Martin Honnen Oct 23 '19 at 11:25
  • Shouldn't `let $new-number := util:random($max) + 1` do the job without having to check for zero? – line-o Oct 25 '19 at 16:07
  • 1
    @line-o, very good point, yes, with all the changes made from my original attempt to simply rely on the XPath 3.1 way to finally get something that works with the original poster's version of eXist I overlooked that simple change in the use of th exist function. – Martin Honnen Oct 25 '19 at 16:32
2

It was such a fun question and interesting answer that I could not help myself than to play with local:random-sequence. Here is what I came up with:

(: needs zero-check, would return 1 item otherwise :)
declare function local:random-sequence($max as xs:integer, $length as xs:integer) as xs:integer* {
    if ($length = 0)
    then ()
    else local:random-sequence((), $max, $length)
};

declare function local:random-sequence($seq as xs:integer*, $max as xs:integer, $length as xs:integer) as xs:integer+ {
    let $new-number := util:random($max) + 1
    let $new-seq :=
        if ($seq = $new-number)
        then $seq
        else ($seq, $new-number)

    return
        if (count($new-seq) >= $length)
        then $new-seq
        else local:random-sequence($new-seq, $max, $length)
};

I think it is a little easier to read and grasp. It also saves 1 function call ;)

line-o
  • 1,885
  • 3
  • 16
  • 33