Automating the process of recoding numeric variables to meaningful factor variables

Question

I have a large data frame (hundreds of variables wide) in which all values of categorical variables are saved as numerics, for example, 1, 2, 8, representing no, yes, and unknown. However, this is not always consistent. There are variables that have ten or more categories with 88 representing unknown etc.

data <- data.frame("ID" = c(1:5),
                   "Var1" = c(2,2,8,1,8),
                   "Var2" = c(5,8,4,88,10))

For each variable, I do have all information on which value represents which category. Currently, I have this information stored in vectors that are each correctly ordered, like

> Var1_values
[1] 8 2 1

with a corresponding vector containing the categories:

> Var1_categories
[1] "unknown" "yes" "no"

But I cannot figure out a process for how to bring this information together in order to automate the recoding process towards an expected result like

| ID | Var1    | Var2              |
|----|---------|-------------------|
| 1  | yes     | condition E       |
| 2  | yes     | condition H       |
| 3  | unknown | condition D       |
| 4  | no      | unknown condition |
| 5  | unknown | condition H       |

where each column is a meaningful factor variable.

As I said, the data frame is very wide and things might change internally, so doing this manually is not an option. I feel like I'm being stupid as I have all the necessary information readily available, so any insight would be greatly appreciated, and a cup of coffee is the least I can do for helpful advice.

// edit:

I forgot to mention that I have already made some kind of a mapping-dataframe but I couldn't really put it to use, yet. It looks like this:

mapping <- data.frame("Variable" = c("Var1", "Var2", "Var3", "Var4"),
                      "Value1" = c(2,2,2,7),
                      "Word1" = c("yes","yes","yes","condition A"),
                      "Value2" = c(1,1,1,6),
                      "Word2" = c("no","no","no","Condition B"),
                      "Value3" = c(8,8,8,5),
                      "Word3" = c("unk","unk","unk", "Condition C"),
                      "Value4" = c(NA,NA,NA,4),
                      "Word4" = c(NA,NA,NA,"Condition B")
                      )

I would like to "long"-transform it so I can use it with @r2evan 's solution.

Do you have var_categories stored in separate variables for each var? You should have given the var2_categories also in the question to clarify the things better. — AnilGoyal, Jun 19 '21 at 08:40

score 1 · Accepted Answer · answered Jun 18 '21 at 23:29

Here's one thought, though it requires reshaping (twice) the data.

mapping <- data.frame(
  Var = c(rep("Var1", 3), rep("Var2", 5)),
  Val = c(1, 2, 8, 4, 5, 8, 10, 88),
  Words = c("no", "yes", "unk", "D", "E", "H", "H", "unk")
)
mapping
#    Var Val Words
# 1 Var1   1    no
# 2 Var1   2   yes
# 3 Var1   8   unk
# 4 Var2   4     D
# 5 Var2   5     E
# 6 Var2   8     H
# 7 Var2  10     H
# 8 Var2  88   unk

library(dplyr)
library(tidyr) # pivot_*
data %>%
  pivot_longer(-ID, names_to = "Var", values_to = "Val") %>%
  left_join(mapping, by = c("Var", "Val")) %>%
  pivot_wider(ID, names_from = "Var", values_from = "Words")
# # A tibble: 5 x 3
#      ID Var1  Var2 
#   <int> <chr> <chr>
# 1     1 yes   E    
# 2     2 yes   H    
# 3     3 unk   D    
# 4     4 no    unk  
# 5     5 unk   H

With this method, you control the number-to-words mapping for each variable.

Another option is to use a map list, similar to above but it does not require double-reshaping.

maplist <- list(
  Var1 = c("1" = "no", "2" = "yes", "8" = "unk"),
  Var2 = c("4" = "D", "5" = "E", "8" = "H", "10" = "H", "88" = "unk")
)
maplist
# $Var1
#     1     2     8 
#  "no" "yes" "unk" 
# $Var2
#     4     5     8    10    88 
#   "D"   "E"   "H"   "H" "unk" 
nms <- c("Var1", "Var2")
data[,nms] <- Map(function(val, lookup) lookup[as.character(val)], 
                  data[nms], maplist[nms])
data
#   ID Var1 Var2
# 1  1  yes    E
# 2  2  yes    H
# 3  3  unk    D
# 4  4   no  unk
# 5  5  unk    H

Between the two, I think I prefer the first if your data doesn't punish you for reshaping it (many things could make this less appealing). One reason it's good is that maintaining the mapping can be as easy as maintaining a CSV (which might be done in your favorite spreadsheet tool, e.g., Excel or Calc).

This has helped me a lot! I can easily append the vectors comprising the "mapping"-data frame with the information available to me. Of course, my data frame has many variables that I don't have to recode (like string data and measurements) but I was able to adapt your solution by referring to these columns in the pivot_longer function. I for got to mention that I already have a wide mapping data-frame but I am having a hard time transforming it to look like your long mapping data-frame. Could you kindly have a look at my mapping data-frame? I put it into my original question. — Klaus Peter, Jun 19 '21 at 12:55
One reason I'm not fond of that style of storing the mapping is that some variables will have significantly more factors than others, leading to a lot of `NA`s for some variables. That's not hard to work around, certainly, so it might be more subjective than a fix is worth. FYI, you can convert from your `mapping` to my (first) version with something like `tidyr::pivot_longer(mapping, -Variable, names_pattern="([^0-9]+)([0-9]+)", names_to=c(".value", "ign")) %>% select(-ign) %>% filter(!is.na(Value))`. — r2evans, Jun 19 '21 at 17:57

Rory S · Answer 2 · 2021-06-19T12:46:15.863

Here's a way of doing it that requires no reshaping of your original data, and can be conceivably applied to any number of columns. First, put all your existing "values" and "categories" vectors into lists, formatting all as characters:

library(tidyverse)

# Recreating your existing vectors
Var1_values <- c(8, 2, 1)
Var2_values <- c(88, 10, 8, 5, 4)
Var1_categories <- c("unknown", "yes", "no")
Var2_categories <- c("unknown condition", "condition H", "condition H", "condition E", "condition D")

Var_values <- list(Var1_values, Var2_values) %>%
  map(as.character)

Var_categories <- list(Var1_categories, Var2_categories)

Add names to each element of each vector in Var_categories using Var_values, and get a list of variable names to recode from your dataset:

for (i in 1:length(Var_categories)) {
  names(Var_categories[[i]]) <- Var_values[[i]]
}

vars_names <- str_subset(colnames(data), "Var")

Then, use map2 to recode all of your target variables, before transforming into a tibble with ID column.

  data_recoded <- map2(vars_names, Var_categories, ~ dplyr::recode(unlist(data[.x], use.names = F), !!!.y)) %>%
    as_tibble(.name_repair = ~ vars_names) %>%
    add_column(ID = 1:5, .before = "Var1")

Output (data_recoded):

     ID Var1    Var2             
  <int> <chr>   <chr>            
1     1 yes     condition E      
2     2 yes     condition H      
3     3 unknown condition D      
4     4 no      unknown condition
5     5 unknown condition H

Thank you very much, I will look into this! However, as far as I can tell, I would still have to create a Var_recode-vector by hand which I would like to avoid as there are many variables with wildly different categories. As I said in my original post, I do have the information ready at hand, albeit in separate vectors. — Klaus Peter, Jun 19 '21 at 11:28
@KlausPeter No worries, I've edited my answer to use the existing vectors you have; hopefully this should prevent you from having to code more information by hand. — Rory S, Jun 19 '21 at 12:48

Automating the process of recoding numeric variables to meaningful factor variables

2 Answers2