2

So my issue is, when I have a data frame and then create a task using mlr3's task$feature_names function, it is returning the variables in alphabetical or a (kind of) incorrect numerical order, whereas I would like to keep the order that the feature names appear in the data frame. I have supplied 2 examples below of what I mean. The 1st example is a (somewhat) numerical example and the 2nd example is alphabetical.

Example 1 (numerical):

library(mlr3)
# Set Values
n <- 10      # No of rows
p <- 10       # No of cols
e <- rnorm(n) # used for noise
b <- 10      


# Create matrix of values
xValues <- matrix(rnorm(n*p), nrow=n)   # Create matrix wt 3 columns
colnames(xValues)<- paste0(1:p)     # Name columns
df <- data.frame(xValues)               # Create dataframe

# Equation 
y <- (b + b*df$X1 - b*df$X2 + (b*df$X3)*(b*df$X2) + e)     # Equation

# Adding y to df
df$y <- y

# mlr3 TASK
test_T = TaskRegr$new(id = "test", backend = df, target = "y")
test_T$feature_names

So in the above example I create some data (i.e., X1 to X11) and then create an mlr3 task. However when I run test_T$feature_names it returns this:

[1] "X1"  "X10" "X2"  "X3"  "X4"  "X5"  "X6"  "X7"  "X8"  "X9" 

So, because of the leading 1 in X10, mlr3 thinks X10 should appear before X2.

Example 2 (alphabetical):

library(mlr3)
a  <-rnorm(10)
b  <-rnorm(10)
ab <-rnorm(10)
ba <-rnorm(10)
c  <-rnorm(10)
myData <- data.frame(a, b, ab, ba, c)
t_T = TaskRegr$new(id = "test", backend = myData, target = "c")
t_T$feature_names

So this time, the order of the variables in my data frame are described by myData (i.e, a, b, ab, ba, c). However, when I run t_T$feature_names, it returns this:

[1] "a"  "ab" "b"  "ba"

It has changed the order to be alphabetical. I'm not sure if this is intentional or an oversight from mlr3... but is there anyway to extract the feature names from an mlr3 created task, where it doesn't re-order the variable names?
I am still stuck on this issue, if anyone has any suggestions?

EDIT: I am adding a (poor) graphical example, just to illustrate the issue. So, continuing on from the numerical example, if I wanted to create a heat map style plot, but using $feature_names to obtain the feature names, I end up with something like this:

nam <- test_T$feature_names

var_int2 = df %>% as_tibble %>% 
  mutate(var_num1 = 1:length(nam)) %>% 
  pivot_longer(cols = 1:length(nam),
               values_to = 'values') %>% 
  mutate(var_num2 = rep(1:length(nam), length(nam)),
         alpha_imp = as.integer(var_num1 == var_num2),
         alpha_int = 1 - alpha_imp)

p <- ggplot(data = var_int2, 
            mapping = aes(x = var_num1, y = var_num2)) + 
  scale_x_continuous(breaks = 1:length(nam), labels = nam, position = "top") + 
  scale_y_reverse(breaks = 1:length(nam), labels = nam) +
  geom_raster(aes(fill = y),
              alpha = var_int2$alpha_int)

p

This will produce something like this: heat map

As can be seen, it is plotting X10 in between X1 and X2. Ideally, I would like to keep the order of the features as they appear in the data frame. I know there might be other ways to reorder the plot, however, I was relying on $feature_names in a large plotting function I had created. Originally, I was using getTaskFeatureNames(task) from mlr, which keeps the feature names in the original order... but I recently updated to mlr3 and that seems to change the order.

Electrino
  • 2,636
  • 3
  • 18
  • 40
  • Could you sum up the final outcome or mark one of the answers as accepted? This would help future readers. Thanks. – pat-s Nov 06 '20 at 21:47

2 Answers2

0

If you can provide an example or use case where the order of the features is important, we can try to keep it.

Michel
  • 635
  • 3
  • 5
-1

We had a short discussion and don't consider this as a bug. You can also look at the data in the task and obtain the column names

task = tsk("mtcars")
task$feature_names
# [1] "am"   "carb" "cyl"  "disp" "drat" "gear" "hp"   "qsec" "vs"   "wt"  
colnames(task$data())
# [1] "mpg"  "am"   "carb" "cyl"  "disp" "drat" "gear" "hp"   "qsec" "vs"   "wt" 

Note, that this contains the target column. Also, it can get slow if you are using another backend then just data.table because the data will be retrieved, whereas $feature_names is independent of the data.

To sum up, you could use this solution of the order is of importance

setdiff(colnames(task$data()), task$target_names)
jakob-r
  • 6,824
  • 3
  • 29
  • 47
  • Thanks for your response. The long and short of it is that I was using `$feature_names` inside a function in my package to obtain the feature names and then I was using this to create some custom plots I had designed. It is not so evident in the alphabetical example, but in the numerical example, it would make sense for my plots to be in numerical order (ie X1,X2,...,Xn). At the moment, it is plotting X10 etc beside X1. In relation to your solution, I'm getting no difference between `$feature_names` and `setdiff(colnames(task$data()), task$target_names)`. Both have the same output – Electrino Jul 02 '20 at 19:46