2

I have a large data frame in R with over 200 mostly character variables that I would like to add factors for. I have prepared all levels and labels in an separate data frame. For a certain variable Var1, the corresponding levels and labels are Var1_v and Var1_b, for example for the variable Gender the levels and labels are named Gender_v and Gender_l.

Here is an example of my data:

df <- data.frame (Gender = c("2","2","1","2"),
                  AgeG = c("3","1","4","2"))

fct <- data.frame (Gender_v  = c("1", "2"),
                  Gender_b = c("Male", "Female"),
                  AgeG_v = c("1","2","3","4"),
                  AgeG_b = c("<25","25-60","65-80",">80"))

df$Gender <- factor(df$Gender, levels = fct$Gender_v, labels = fct$Gender_b, exclude = NULL)
df$AgeG <- factor(df$AgeG, levels = fct$AgeG_v, labels = fct$AgeG_b, exclude = NULL)

Is there away to automatize the process, so that the factors (levels and labels) are applied to corresponding variables without having me doing every single one individually? I think it's done through a function probebly with pmap.

My goal is minimize the effort needed for this process. Is there a better way to prepare the labels and levels as well?

Help is much appreciated.

ebay
  • 109
  • 1
  • 7
  • There is the option `stringsAsFactors` in the creation of data frames. This may be useful earlier in your data pipeline. The error in your example code is due to your Gender_v and AgeG_v being stored as character values instead of numerical values. Your current code works when `Gender_v = c(1,2)` i.e. no quotation marks. – typewriter Jan 20 '22 at 21:49
  • @typewriter How should `stringsAsFactors` exactly help? I am not running any error in my code btw. It is just inefficient when you have to run it to over 200 variables. – ebay Jan 20 '22 at 21:57

2 Answers2

2

I solved it with a simple refactoring of your code, automatizing thought a loop. The more data you add, the better your time spent. I believe this fct[[paste0(names(df[i]),"_v")]] can be refactored in an small function to look even better

> df <- data.frame (Gender = c("2","2","1","2"),
+                   AgeG = c("3","1","4","2"))
> 
> fct <- data.frame (Gender_v  = c("1", "2"),
+                    Gender_b = c("Male", "Female"),
+                    AgeG_v = c("1","2","3","4"),
+                    AgeG_b = c("<25","25-60","65-80",">80"))
> 
> for(i in 1:ncol(df)){
+   
+   le <- fct[[paste0(names(df[i]),"_v")]]
+   
+   la <- fct[[paste0(names(df[i]),"_b")]]
+   
+   df[,i] <- factor(df[,i],levels = le ,labels = la,exclude = NULL)
+   
+ }
> 
> df
  Gender  AgeG
1 Female 65-80
2 Female   <25
3   Male   >80
4 Female 25-60
>

Edit: Here is the if condition added


> df <- data.frame (Gender_f = c("2","2","1","2"),
+                             AgeG_f = c("3","1","4","2"),
+                   AgeN = c(70,15,96,30))
> 
> fct <- data.frame (Gender_v  = c("1", "2"),
+                                   Gender_b = c("Male", "Female"),
+                                   AgeG_v = c("1","2","3","4"),
+                                  AgeG_b = c("<25","25-60","65-80",">80"))
> 
> for(i in 1:ncol(df)){
+ 
+   if(endsWith(names(df[i]),"_f")){
+     
+     name <- str_remove(names(df[i]),"_f")
+   
+     le <- fct[[paste0(name,"_v")]]
+    
+     la <- fct[[paste0(name,"_b")]]
+      
+     df[,i] <- factor(df[,i],levels = le ,labels = la,exclude = NULL)
+   
+   }
+      
+ }
> 
> df
  Gender_f AgeG_f AgeN
1   Female  65-80   70
2   Female    <25   15
3     Male    >80   96
4   Female  25-60   30
> 
AugtPelle
  • 549
  • 1
  • 10
  • Thanks. You methods works flawlessly with the example, but does not with my data. The problem in your code is probably the assumption that there is a factor, a level and a label for each variable. But this is not true. This turns other values in other variables into missings. – ebay Jan 20 '22 at 22:57
  • Yes, you are correct ! Just asking, if the variable is a factor, you will always have 2 entries in the fct data frame ? Because in that case, with an if condition it's solved. – AugtPelle Jan 20 '22 at 23:02
  • Yes I guess. There should be 2 entries in the fct data frame for each variable. I was thinking about a method, that looks at the name of a variable in the df data frame, then adds "_v" and "_b" to create the factor for this variable from the fct data frame. How would you add the if condition btw. I more experienced in SAS than R. – ebay Jan 20 '22 at 23:17
  • I was wondering, what would also differ if I added a label for an NA level? – ebay Jan 20 '22 at 23:24
  • 1
    I edited it with the if condition added ! – AugtPelle Jan 20 '22 at 23:33
  • Based on your response, my recommendation would be to just create or read the variables. After that, I would transform variables within an specific pattern like ending with "_f" to factors using as.factor(). That is all ! Really simple. – AugtPelle Jan 20 '22 at 23:36
1

A data frame is not really an appropriate data structure for storing the factor level definitions in: there’s no reason to expect all factors to have an equal amount of levels. Rather, I’d just use a plain list, and store the level information more compactly as named vectors, along these lines:

df <- data.frame(
  Gender = c("2", "2", "1", "2"),
  AgeG = c("3", "1", "4", "2")
)

value_labels <- list(
  Gender = c("Male" = 1, "Female" = 2),
  AgeG = c("<25" = 1, "25-60" = 2, "65-80" = 3, ">80" = 4)
)

Then you can make a function that uses that data structure to make factors in a data frame:

make_factors <- function(data, value_labels) {
  for (var in names(value_labels)) {
    if (var %in% colnames(data)) {
      vl <- value_labels[[var]]
      data[[var]] <- factor(
        data[[var]],
        levels = unname(vl),
        labels = names(vl)
      )
    }
  }
  data
}

make_factors(df, value_labels)
#>   Gender  AgeG
#> 1 Female 65-80
#> 2 Female   <25
#> 3   Male   >80
#> 4 Female 25-60
Mikko Marttila
  • 10,972
  • 18
  • 31
  • Thanks Mikko. I have changed one thing in your code to make it easier. I have switched the positions of the levels and labels to make it for example `1 = 'Male'` instead of `'Male' = 1`, and changed the function accordingly `levels = names(vl), labels = unname(vl)`. – ebay Jan 21 '22 at 15:03