3

I have data where each row is a person. I want to make a randomly generated unique ID, so I can identify them in analysis.

Here is a sample dataframe

df <- data.frame(
  gender = rep(c("M", "F", "M", "M", "F"), 1000),
  qtr = sample(c(1:99), 50000, replace = T),
  result = sample(c(100:1000), 50000, replace = T)
)

To generate a unique ID, I am using stringi

library(stringi)
library(magrittr)
library(tidyr)

df <- df %>%
  mutate(UniqueID = do.call(paste0, Map(stri_rand_strings, n=50000, length=c(2, 6),
                                        pattern = c('[A-Z]', '[0-9]'))))

However, when I test to see if the new variable UniqueID is unique, by running this code, I find there are some duplicates.

length(unique(unlist(df[c("UniqueID")])))

Is there a way to generate a unique ID which is truly unique, with no duplicates?

I have seen these questions, but it doesn't answer how to make the random number generated unique. Generating unique random numbers in dataframe column in R Create a dataframe with random numbers in each column

Thanks

Laura
  • 499
  • 5
  • 13

3 Answers3

8

You can use the ids package to create unique ID's automatically. For instance, to make 10 million user ID's, you could use:

randos <- ids::random_id(1E6, 4)
# The 2nd term here controls how many bytes are assigned to each ID.
# The default, 16 bytes, makes much longer IDs and crashes my computer

head(randos)
#[1] "31ca372d" "d462e55f" "2374cc78" "15511574" "ecbf2d65" "236cb2d3"

It has other nice features, like the adjective_animal function, which creates IDs that are easier for humans to distinguish and remember.

creatures <- ids::adjective_animal(1E6, n_adjectives = 1)
head(creatures)
#[1] "yestern_lizard"          "insensible_purplemarten"
#[3] "cubical_anhinga"         "theophilic_beaver"      
#[5] "subzero_greyhounddog"    "hurt_weasel"   
Jon Spring
  • 55,165
  • 4
  • 35
  • 53
  • Hmm for some reason with your first command and 10,000 IDs I had one duplicate. I ran `name <- ids::random_id(n, 4)` and got the value `519299b2` twice – Patrick Coulombe Aug 25 '21 at 18:58
  • 1
    Thanks, I didn't realize they were not reliably unique. Here's a related question with some alternate suggestions to ensure / increase chances of uniqueness: https://stackoverflow.com/a/64139202/6851825 – Jon Spring Aug 26 '21 at 00:24
  • Thanks, I did just that (repeat() with a break when there's no duplicates). In fact I think I was just unlucky, as I couldn't reproduce it afterwards... oh well. – Patrick Coulombe Aug 26 '21 at 00:30
2

It may not be what you want but, using your own script, you can always generate a larger vector of random strings (say 60,000) and the subset a defined number of unique strings as you wish (50,000):

df <- df %>%
  mutate(UniqueID = sample( unique(do.call(paste0, 
                                           Map(stri_rand_strings, n=60000, length=c(2, 6),
                                           pattern = c('[A-Z]', '[0-9]')))), 50000) ) 

length(unique(unlist(df[c("UniqueID")])))
Majid
  • 1,836
  • 9
  • 19
1

Generating random strings can result into duplicates, one thing which we can do is make random strings with rules complicate enough so that the probability of occurrence of duplicates becomes really small. For example, combine two random strings to make an unique ID like

library(stringi)
df$UniqueID <- paste0(stri_rand_strings(5000, 2, '[A-Z]'), 
                      stri_rand_strings(5000, 6,'[0-9]'))

This reduces the chances of UniqueID being duplicate drastically. You can try various such combinations with different length and pattern argument to make unique ID's.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213