So my issue is, when I have a data frame and then create a task using mlr3
's task$feature_names
function, it is returning the variables in alphabetical or a (kind of) incorrect numerical order, whereas I would like to keep the order that the feature names appear in the data frame. I have supplied 2 examples below of what I mean. The 1st example is a (somewhat) numerical example and the 2nd example is alphabetical.
Example 1 (numerical):
library(mlr3)
# Set Values
n <- 10 # No of rows
p <- 10 # No of cols
e <- rnorm(n) # used for noise
b <- 10
# Create matrix of values
xValues <- matrix(rnorm(n*p), nrow=n) # Create matrix wt 3 columns
colnames(xValues)<- paste0(1:p) # Name columns
df <- data.frame(xValues) # Create dataframe
# Equation
y <- (b + b*df$X1 - b*df$X2 + (b*df$X3)*(b*df$X2) + e) # Equation
# Adding y to df
df$y <- y
# mlr3 TASK
test_T = TaskRegr$new(id = "test", backend = df, target = "y")
test_T$feature_names
So in the above example I create some data (i.e., X1 to X11) and then create an mlr3
task. However when I run test_T$feature_names
it returns this:
[1] "X1" "X10" "X2" "X3" "X4" "X5" "X6" "X7" "X8" "X9"
So, because of the leading 1 in X10, mlr3
thinks X10 should appear before X2.
Example 2 (alphabetical):
library(mlr3)
a <-rnorm(10)
b <-rnorm(10)
ab <-rnorm(10)
ba <-rnorm(10)
c <-rnorm(10)
myData <- data.frame(a, b, ab, ba, c)
t_T = TaskRegr$new(id = "test", backend = myData, target = "c")
t_T$feature_names
So this time, the order of the variables in my data frame are described by myData
(i.e, a, b, ab, ba, c). However, when I run t_T$feature_names
, it returns this:
[1] "a" "ab" "b" "ba"
It has changed the order to be alphabetical. I'm not sure if this is intentional or an oversight from mlr3
... but is there anyway to extract the feature names from an mlr3
created task, where it doesn't re-order the variable names?
I am still stuck on this issue, if anyone has any suggestions?
EDIT: I am adding a (poor) graphical example, just to illustrate the issue. So, continuing on from the numerical example, if I wanted to create a heat map style plot, but using $feature_names
to obtain the feature names, I end up with something like this:
nam <- test_T$feature_names
var_int2 = df %>% as_tibble %>%
mutate(var_num1 = 1:length(nam)) %>%
pivot_longer(cols = 1:length(nam),
values_to = 'values') %>%
mutate(var_num2 = rep(1:length(nam), length(nam)),
alpha_imp = as.integer(var_num1 == var_num2),
alpha_int = 1 - alpha_imp)
p <- ggplot(data = var_int2,
mapping = aes(x = var_num1, y = var_num2)) +
scale_x_continuous(breaks = 1:length(nam), labels = nam, position = "top") +
scale_y_reverse(breaks = 1:length(nam), labels = nam) +
geom_raster(aes(fill = y),
alpha = var_int2$alpha_int)
p
This will produce something like this:
As can be seen, it is plotting X10 in between X1 and X2. Ideally, I would like to keep the order of the features as they appear in the data frame. I know there might be other ways to reorder the plot, however, I was relying on $feature_names
in a large plotting function I had created. Originally, I was using getTaskFeatureNames(task)
from mlr
, which keeps the feature names in the original order... but I recently updated to mlr3
and that seems to change the order.