3

I have a set of UK postcodes which need to be reformatted. They are made up of an incode and an outcode, where incode is of the form 'number letter letter' e.g. 2DB and the outcode is a combination of between 2 and 4 letters and numbers e.g. NW1 or SW10 or EC1A

Currently there is one space between the incode and outcode, but I need to reformat these so that the full postcode is 7 characters long e.g: ('-' stands for space)

  • NW1-2DB -> NW1-2DB (1 space between outcode and incode)
  • SW10-9NH -> SW109NH (0 spaces)
  • E1-6QL -> E1--6QL (2 spaces)

Data:

df <- data.frame("postcode"=c("NW1 2DB","SW10 9NH","E1 6QL"))
df
#   postcode
# 1  NW1 2DB
# 2 SW10 9NH
# 3   E1 6QL

I have written a regex string to separate the outcode and incode, but couldn't find a way to add a variable number of spaces between them (this example just creates two spaces between outcode and incode).

require(dplyr)
df <- df %>% mutate(postcode_2sp = gsub('?(\\S+)\\s*?(\\d\\w{2})$','\\1  \\2', postcode)

To get around that I've tried to use mutate(),nchar() and rep():

df<-df %>% 
  mutate(outcode=gsub('?(\\S+)\\s*\\d\\w{2}$','\\1',postcode),
         incode=gsub('\\S+\\s*?(\\d\\w{2})$','\\1',postcode)) %>%
  mutate(out_length=nchar(outcode))%>%
  mutate(postcode7=paste0(outcode,
                          paste0(rep(" ",4-out_length),collapse=""),
                          incode))

but get this error:

Error: invalid 'times' argument

without the last step to create postcode7 the df looks as follows:

df
#   postcode outcode incode out_length 
# 1  NW1 2DB     NW1    2DB          3  
# 2 SW10 9NH    SW10    9NH          4 
# 3   E1 6QL      E1    6QL          2 

And if I set the rep 'times' argument to a constant the code runs as expected (but doesn't do what I need it to do!)

df<-df %>% 
  mutate(outcode=gsub('?(\\S+)\\s*\\d\\w{2}$','\\1',postcode),
         incode=gsub('\\S+\\s*?(\\d\\w{2})$','\\1',postcode)) %>%
  mutate(out_length=nchar(outcode))%>%
  mutate(postcode7=paste0(outcode,
                          paste0(rep(" ",4),collapse=""),
                          incode))
df
#   postcode outcode incode out_length   postcode7
# 1  NW1 2DB     NW1    2DB          3  NW1    2DB
# 2 SW10 9NH    SW10    9NH          4 SW10    9NH
# 3   E1 6QL      E1    6QL          2   E1    6QL

Is there a way to make rep() accept a column as the times argument in a mutate? Or should I be looking at a totally different approach?

EDIT: I've just realised that I can use an if statement for each case of 2 characters, 3 characters or 4 characters in the outcode but that doesn't feel very elegant.

Psidom
  • 209,562
  • 33
  • 339
  • 356
lapsel
  • 75
  • 5
  • Do you have to use regular expression to split your post code? What's the problem with `strsplit`? – Psidom Jun 29 '16 at 16:59
  • @Psidom By default, `strsplit` also uses a regular expression — but the problem is that `strsplit` needs *something* to split on. If you look at OP’s regex, you’ll see that the space in the middle is entirely optional in the input. – Konrad Rudolph Jun 29 '16 at 17:14
  • You are right, `strsplit` works as long as there is a space between the incode and outcode (as I have specified), but Konrad is right in that postcodes aren't always formatted like that. My bad for being too specific in the question. – lapsel Jun 29 '16 at 17:14
  • @KonradRudolph OK. I was reading OP's statement as *Currently there is one space between the incode and outcode*. Didn't look at the regex expression very carefully. That makes sense. – Psidom Jun 29 '16 at 17:17

4 Answers4

4

Have a look at the str_pad method from stringr package, which is suited for your case:

library(stringr)
df<-df %>% 
    mutate(outcode=gsub('?(\\S+)\\s*\\d\\w{2}$','\\1',postcode),
           incode=gsub('\\S+\\s*?(\\d\\w{2})$','\\1',postcode)) %>%
    mutate(out_length=nchar(outcode)) %>% 
    mutate(postcode7 = paste(outcode, str_pad(incode, 7-out_length), sep = ""))

df
#   postcode outcode incode out_length postcode7
# 1  NW1 2DB     NW1    2DB          3   NW1 2DB
# 2 SW10 9NH    SW10    9NH          4   SW109NH
# 3   E1 6QL      E1    6QL          2   E1  6QL
Psidom
  • 209,562
  • 33
  • 339
  • 356
3

Another solution, using sprintf to format the output, and tidyr::extract for matching. This has the advantage of drastically simplifying both the pattern and the code for padding:

df %>%
    extract(postcode, into = c('out', 'in'), '(\\S{2,4})\\s*(\\d\\w\\w)') %>%
    mutate(postcode = sprintf('% -4s%s', out, `in`))

I do like the separate version posted above, but it requires that the postcodes are all separated by whitespace. In my experience this generally isn’t the case.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
2

Using str_pad and separate:

library(dplyr)
library(tidyr)
library(stringr)

df %>% 
  separate(postcode, into = c("incode", "outcode"), remove = FALSE) %>% 
  mutate(
    postcode8 = paste0(incode,
                       str_pad(outcode,
                               8 - nchar(incode), side = "left", pad = " ")))

#   postcode incode outcode postcode8
# 1  NW1 2DB    NW1     2DB  NW1  2DB
# 2 SW10 9NH   SW10     9NH  SW10 9NH
# 3   E1 6QL     E1     6QL  E1   6QL
zx8754
  • 52,746
  • 12
  • 114
  • 209
1
df%>%mutate(Postcode7=paste0(format(gsub('\\s.*$','',postcode),justify='left'),
                        format(gsub('^\\S+\\s','',postcode),justify='right')))
Shenglin Chen
  • 4,504
  • 11
  • 11
  • This code doesn’t work in general, it requires (a) that there’s at least one space separating the postcode groups, and (b) that the data frame contains at least one postcode that’s already correctly formatted; otherwise the justification won’t work. – Konrad Rudolph Jun 30 '16 at 09:27