1

I need to generate large numbers of deterministic UUIDs which works great for small numbers of values using the pattern

uuidgen --sha1 -n ${namespace} -N ${name}

However, this is too slow for generating hundreds of thousands of them at a time. I can write this in something else, but there other reasons why I'd like to use bash unless it's truly the wrong tool for the job. It is essential that the uuids are the same regardless of when and when they are generated for the application, so changing the algorithm is not an option.

Is there a way this can be done "reasonably" efficiently via bash or do I really need to use a different tool?

Kyle Banerjee
  • 2,554
  • 4
  • 22
  • 30
  • 1
    If you care about performance, you probably shouldn't be using scripting language in the first place... – AgainPsychoX Mar 22 '22 at 15:42
  • Performance is not typically a concern, and for many other processes, scripting can be quite fast -- especially for the awk routines where we expect tens of thousands of records per second. uuidgen is particularly slow – Kyle Banerjee Mar 22 '22 at 15:43
  • Have you tried running it in parallel? – AgainPsychoX Mar 22 '22 at 15:44
  • each `uuidgen` call (in `bash`) requires a new OS process to be created, processed and then discarded; doing this 100's of thousands of times ... is going to take a long time; while you could certainly put an `awk` wrapper around it you won't get much/any benefit since `awk` will need to invoke a new `uuidgen` instance each time (ie, `awk` is making an external OS call to `uuidgen`); you'll likely have better luck (and performance?) with a language (java? python?) that has a uuid generator/module – markp-fuso Mar 22 '22 at 15:47
  • if you need the ability to (re)generate the same uuid's, any chance of generating them once and then storing them in a database/datastore for future reference? – markp-fuso Mar 22 '22 at 15:51
  • I hadn't tried parallel -- but a quick experiment is showing a significant boost. This particular routine isn't a system call from awk, but it is a subshell Python is my fallback -- I know it will cut through this like butter. Part of the reason I didn't do that from the beginning has to do with environments on other machines where this will be run – Kyle Banerjee Mar 22 '22 at 15:51
  • 1
    @markp-fuso That won't work for this particular application, but you make me realize that for reasons specific to this project I can leverage output from another data source to improve performance by three orders of magnitude – Kyle Banerjee Mar 22 '22 at 15:54

0 Answers0