Replace letters in a template string in R

Question

Given a "template" of a UK postcode, such as "A9 9AA", where "A" is a letter placeholder, and "9" is a number placeholder, I want to generate random postcode strings like "H8 4GB". Letters can be any uppercase letter, and numbers anything from 0 to 9.

So if the template is "AA9A 9AA" then I want strings like "WC1A 9LK". I'm ignoring for now generating "real" postcodes, so I'm not bothered if "WC1A" is a valid outward code.

I've scraped around trying to get functions from the stringi package to work, but the problem seems to be that replacing or matching the "A"s in a template will only replace the first replacement, for example:

 stri_replace_all_fixed("A9 9AA",c("A","A","A"), c("X","Y","Z"), vectorize_all=FALSE)
[1] "X9 9XX"

so it doesn't replace each "A" with each element from the replacement vector (but this is by design).

Maybe there's something in stringi or base R that I've missed - I'd like to keep it in those packages so I don't bloat what I'm working on.

The brute-force method is to split the template, do replacements, paste the result back together but I'd like to see if there's a quicker, naturally vectorised solution.

So to summarise:

foo("A9 9AA") # return like "B6 5DE"
foo(c("A9 9AA","A9 9AA","A9A 9AA")) # returns c("Y6 5TH","D4 8JH","W0Z 3KQ")

Here's a non-vectorised version which relies on constructing an expression and evaluating it...

random_pc <- function(fmt){
    cc = gsub(" ",'c(" ")',gsub("9","sample(0:9,1)",gsub("A","sample(LETTERS,1)",strsplit(fmt,"")[[1]])))
    paste(eval(parse(text=paste0("c(",paste(cc,collapse=","),")"))),collapse="")    
}

> random_pc("AA9 9AA")
[1] "KO6 1AY"

I assume it is important that it is an arbitrary template, and that you're not just looking for something to generate a pattern? — Andreas Storvik Strauman, Apr 06 '18 at 17:27
There's a small number of possible formats - see https://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom - but I might also want to generate bad ones for testing. — Spacedman, Apr 06 '18 at 17:42
And I suspect using `lapply` to iterate over the template counts as a "non-quick, non-natural vectorised" brute-force method? — Andreas Storvik Strauman, Apr 06 '18 at 17:51

MKR · Answer 1 · 2018-04-06T18:48:26.740

4

As I understand, OP wants to randomly create UK POST CODE in specified format. I think sprintf can help like:

sprintf("%s%s %d%d%s", sample(LETTERS,1),sample(LETTERS,1), sample(0:9,1),
                sample(0:9,1), sample(LETTERS,1) )
#1] "BC 59D"

Now, if purpose is to provide the format using 9 and A then step will be to first replace 9 with %d and A with %s.

OPTION#2

Another option can be achieved using paste0 and sapply using a custom function as:

fmt <- "AA 9AA A"
paste0(sapply(strsplit(fmt,""), getCodeText), collapse = "")
#"YF 7OP Z"


#custom function to generate random characters
getCodeText <- function(x){
  retVal = x
  for(i in seq_along(x)){
    if(x[i] == "A"){
      retVal[i] = sample(LETTERS,1)
    }else if(x[i] == "9"){
      retVal[i] = as.character(sample(0:9,1))
    }
  }
  retVal
}

edited Apr 06 '18 at 18:48

answered Apr 06 '18 at 18:21

MKR

19,739
4
23
33

1

I've edited my q to show an answer based on constructing and evaluating an expression like your first option. – Spacedman Apr 06 '18 at 20:57
@Spacedman That looks good. Actually I was working on something similar for my option#2. Finally realized that `paste0` with custom function can be quicker. – MKR Apr 06 '18 at 21:07

score 1 · Answer 2 · answered Apr 07 '18 at 18:13

Here's a solution (vectorised the lazy way) that splits the format and then replaces based on character or numeric:

randpc <- Vectorize(function(s){
    s = strsplit(s,"")[[1]]
    NUMS = as.character(0:9)
    nLet = sum(s %in% LETTERS)
    nDig = sum(s %in% NUMS)
    s[s %in% LETTERS] = sample(LETTERS, nLet, replace=TRUE)
    s[s %in% NUMS] = sample(NUMS, nDig, replace=TRUE)
    paste0(s, collapse="")
})

Has the useful side effect of returning a named vector that shows the format string:

> randpc(c("AA9 9AA","A9 9AA"))
  AA9 9AA    A9 9AA 
"QS4 4LW"  "S9 7EU"

Its also flexible in that it can create postcodes based on another postcode, since it accepts any letter or number in the format string:

> randpc(rep("LA1 4YF",3))
  LA1 4YF   LA1 4YF   LA1 4YF 
"OL2 5OJ" "YK3 3YB" "FV0 1LW"

Good efforts!! I think it will be worth finding efficiency of these different answers. — MKR, Apr 07 '18 at 18:28
This solution is about 10x faster than the `eval`-based one in my Q. — Spacedman, Apr 09 '18 at 08:22

score 0 · Answer 3 · answered Apr 06 '18 at 21:32

I am not sure what counts as brute force, since a split-replace-combine workflow on the strings seemed the most intuitive to me. However, my first attempts were pretty slow with very large numbers of templates. I had also hoped something like stri_replace_all(replacement = sample(LETTERS, 1)) would work but it also only replaces with the same letter.

This is a slightly different approach using stri_replace_first, replacing the first instance of a template character until there are no template characters left. This means I switch the template to be lowercase l for letters and n for numbers, since postcodes are uppercase letters and numbers only (as far as I know). I think the running time is a lot more reasonable (~10 secs) for 100k templates and this also only uses stringi.

library(stringi)

make_postcodes <- function(templates){
  postcodes <- templates
  while (any(stri_detect_regex(postcodes, "l|n"))){
    for (i in 1:length(templates)){
      postcodes[i] <- stri_replace_first_fixed(
        str = postcodes[i],
        pattern = "l",
        replacement = sample(LETTERS, 1)
        )
      postcodes[i] <- stri_replace_first_fixed(
        str = postcodes[i],
        pattern = "n",
        replacement = sample(0:9, 1)
        )
    }
  }
  postcodes
}

make_postcodes("ln nll")
#> [1] "W8 3MX"
make_postcodes(c("ln nll", "ln nll", "lnl nll"))
#> [1] "H1 6TN"  "C5 6YI"  "A3I 2DB"

test = stri_trim_both(stri_rand_strings(100000, sample(5:9, 1), pattern = "[nl\\ ]"))
tictoc::tic("Time to convert 100,000 templates")
x <- make_postcodes(test)
tictoc::toc()
#> Time to convert 100,000 templates: 12.03 sec elapsed
head(test)
#> [1] "lnnl"  "ll l"  "nl n"  "ll  l" "ll l"  "ll n"
head(x)
#> [1] "G91U"  "HU N"  "2Q 7"  "EU  Z" "PD I"  "SM 4"

Created on 2018-04-06 by the reprex package (v0.2.0).

Replace letters in a template string in R

3 Answers3