Convert character column to factor preserving column label

Question

I have a dataframe that I read from the XLSX file. Every column name looks like this: CODE___DESCRIPTION so for example A1___Some funky column here. It is easier to use the codes as colnames but I want to use description when needed so it must be stored in the dataframe. This is why I am using sjlabelled package later on.

Make yourself some random data and save it as some_data.xlsx.

library(dplyr) #to play with tibbles
library(stringi) #to play with strings
library(writexl) #name speaks for itself

tibble(col1 = sample(c("a", "b", "c", NA, "N/A"), 50, replace = T),
       col2 = sample(c("d", "e", "f", NA, "N/A"), 50, replace = T),
       col3 = sample(c("g", "h", "i", NA, "N/A"), 50, replace = T),
       col4 = sample(c("j", "k", "l", NA, "N/A"), 50, replace = T)) %>%
  setNames(stri_c("A", 1:4, "___", stri_rand_strings(4, 10))) %>%
  write_xlsx(path = "some_data.xlsx", col_names = T, format_headers = F)

I've created simple function to prepare my data the way I want it.

library(sjlabelled) #to play with labelled data

label_it <- function(data = NULL, split = "___"){

#This basically makes an array of two columns (of codes and descriptions respectively)

  k.n <- data %>%
    names() %>%
    stri_split_fixed(pattern = split, simplify = T)

  data%>%
    set_label(k.n[,2]) %>% #set description as each column's label
    setNames(k.n[,1]) #set code as each column's name

}

First I read the data from XLSX file. Then I label it.

library(readxl) #name speaks for itself again

data <- read_xlsx("some_data.xlsx", na = c("", "N/A")) %>% 
   label_it()

Now each of my dataframe's column is character vector (in fact it's a structure) with two attributes:

label being description part
names being the original dataframe column name (CODE___DESCRIPTION style) and is not to be mistaken for output of names(data) which would be the codes part

Let's say I would like to change first and third column to factor.

To do this I have tried two things:

data[,1] <- factor(data[,1], levels = c("c", "a", "b"))
data[,3] <- factor(data[,3], levels = c("h", "g", "i"))

this changes all of those two columns values to NA_integer_.

data <- data %>%
  mutate(A1 = factor(A1, levels = c("c", "a", "b")),
         A3 = factor(A3, levels = c("h", "g", "i")))

this changes character vectors to factors as intended, but it drops both column attributes (label and names) which I need to be preserved.

I also tried quite a lot of functions from sjlabelled, labelled and haven packages. Nothing worked as I intended. Finally, I have found a solution, but it isn't perfect and I would love to find an easier way of doing this.

The solution is to lose those attributes but then regain ('copy' in fact) them.

data <- data %>%
  mutate(A1 = factor(A1, levels = c("c", "a", "b")),
         A3 = factor(A3, levels = c("h", "g", "i"))) %>%
  copy_labels(data)

copy_labels is function from sjlabelled package which is used when labels are lost due to e.g. data subsetting as in this example.

P.S. I would love to add r-sjlabelled and r-labelled tags because those packages are considered in this problem but am under 1500 reputation required to do this.

I don't see them lost. Can you check `data %>% mutate(A1 = factor(A1, levels = c("c", "a", "b")), A3 = factor(A3, levels = c("h", "g", "i"))) %>% str` ? — Ronak Shah, Jun 18 '19 at 04:36
@RonakShah result of str() is Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 50 obs. of 4 variables: $ A1: Factor w/ 3 levels "c","a","b": 3 NA NA NA 1 1 2 NA NA NA ... $ A2: chr "f" NA NA "f" ... ..- attr(*, "label")= Named chr "CYiJDKaPH6" .. ..- attr(*, "names")= chr "A2___CYiJDKaPH6" $ A3: Factor w/ 3 levels "h","g","i": 3 1 3 2 NA NA 2 1 3 2 ... $ A4: chr NA NA NA NA ... ..- attr(*, "label")= Named chr "zNDs9fkddw" .. ..- attr(*, "names")= chr "A4___zNDs9fkddw" — Matiasko, Jun 18 '19 at 06:14
so you do have `attributes` `label` in there right? Check `lapply(data, attributes)` ? — Ronak Shah, Jun 18 '19 at 06:26
@RonakShah No, for mutated columns (A1, A3) they are changed to factors with *levels* parameter set. Both those columns lose both attributes I want to preserve (*label* and *names*). What I want is the result of what I proposed in the last code chunk. That is `data %>% mutate(A1 = factor(A1, levels = c("c", "a", "b")), A3 = factor(A3, levels = c("h", "g", "i"))) %>% copy_labels(data) %>% str` giving result as — Matiasko, Jun 18 '19 at 06:50
@RonakShah (without A4 because comment would be too long) Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 50 obs. of 4 variables: $ A1: Factor w/ 3 levels "c","a","b": 3 NA NA NA 1 1 2 NA NA NA ... ..- attr(*, "label")= Named chr "BkmIPC7szU" .. ..- attr(*, "names")= chr "A1___BkmIPC7szU" $ A2: chr "f" NA NA "f" ... ..- attr(*, "label")= Named chr "CYiJDKaPH6" .. ..- attr(*, "names")= chr "A2___CYiJDKaPH6" $ A3: Factor w/ 3 levels "h","g","i": 3 1 3 2 NA NA 2 1 3 2 ... ..- attr(*, "label")= Named chr "WeCzTXDoTy" .. ..- attr(*, "names")= chr "A3___WeCzTXDoTy" — Matiasko, Jun 18 '19 at 06:51

Convert character column to factor preserving column label

0 Answers0