2

Very often we need to extract random samples of a large dataset? What is the best way to do it on openrefine? This might be useful for practitioners used to do it in R and Python.

Thanks in advance for any advice!

Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
Joni Hoppen
  • 658
  • 5
  • 23

1 Answers1

4

Open Refine has not built-in function for that, but you can use Python/Jython to create a new column of random integers. eg, if you have 100 000 rows :

import random
return random.randint(0, 100000)

Then, you can sort this columns, reorder rows permanently and select for example the first thousand with a custom text facet :

row.index < 1000

EDIT : I forgot that this extension from @OwenStephens adds a randomNumber GREL function. Feel free to install it.

enter image description here

Ettore Rizza
  • 2,800
  • 2
  • 11
  • 23
  • That helps a lot. Thanks once again! Ettore! – Joni Hoppen Sep 06 '17 at 03:10
  • You're welcome. Answer edited by the way. Note: If you have specific questions about OpenRefine, you can also ask them on the dedicated [Google group](https://groups.google.com/forum/#!forum/openrefine). – Ettore Rizza Sep 06 '17 at 03:39