4

I use the partykit package and come across the following error message:

Error in matrix(0, nrow = mi, ncol = nl) : 
invalid 'nrow' value (too large or NA)
In addition: Warning message:
In matrix(0, nrow = mi, ncol = nl) :
NAs introduced by coercion to integer range

I used the example given in this article, which compares packages and their handling with a lot of categories.

The problem is, that the used splitting variable has too many categories. Within the mob() functions a matrix with all possible splits is created. This matrix alone is of size p * (2^(p-1)-1), where p is the number of categories of the splitting variable. Depending on the used system resources (RAM etc.) the given error occurs for different numbers of p.

The article suggest the use of the Gini criterion. I think with the intention of the partykit package, the Gini criterion can not be used, because I do not have a classification problem with a target variable, but a model specification problem.

My question therefore: is there is a way to find the split for such cases or a way to reduce the number of splits to check?

Jakob Gepp
  • 463
  • 3
  • 10

2 Answers2

2

This trick of searching just k ordered splits rather then 2^k -1 unordered partitions only works under certain circumstances, e.g., when it is possible to order the response by their average value within each category. I have never looked at the underlying theory in close enough detail but this only works under certain assumptions and I'm not sure whether these are spelled out nicely enough somewhere. You certainly need a univariate problem in the sense that only one underlying parameter (typically the mean) is optimized. Possibly continuous differentiability of the objective function might also be an issue, given the emphasis on Gini.

As mob() is probably most frequently applied in situations where you partition more than a single parameter, I don't think it is possible to exploit this trick. Similarly, ctree() can easily be applied in situations with multivariate scores, even the response variable is univarite (e.g., for capturing location and scale difference).

I would usually recommend to break down the factor with the many levels into smaller pieces. For example, if you have a factor for the ZIP code of an observation: Then one could use a factor for state/province, and a numeric variable coding the "size" (area or population), a factor coding rural vs. urban, etc. Of course, this is additional work but typically also leads to more interpretable results.

Having said that, it is on our wish list for partykit to exploit such tricks if they are available. But it is not on the top of our current agenda...

Achim Zeileis
  • 15,710
  • 1
  • 39
  • 49
  • 1
    I agree with that logic. Having said that - and because it may not always be possible to get "smaller pieces" - I am trying to add a cluster analysis (kmeans) for the split variables which have more than a specific number of categories (e.g. 10). The clusters are build on the parameters for each single category. I hope to find maybe not the best split, but at least a good one, so that i can use large number of factors. I know, that i may run into the problem, that a single category is smaller than the minsize. Do you have any thoughts about this? – Jakob Gepp Jul 05 '17 at 13:28
  • I had once the idea to run k-means (or something similar) on the score matrix. Thus, instead of doing the score-based parameter instability tests _along_ a covariate, run a score-based k-means without using any covariate. In the _very limited_ scenarios I looked at this wasn't very promising, though. In the situation with a categorical covariate with many levels, one could try the following: aggregate the scores within each category and then run 2-means (or k-means for k = 1, 2, 3, ...). It's worth a try, I think. – Achim Zeileis Jul 05 '17 at 20:21
0

The way I used to solve the problem was through the transformation of the variable in a contrast matrix, using model.matrix(~ 0 + predictor, data). ctree() cannot manage long factors but can easily manage datasets with many variables.

Of course, there are drawbacks, with this technique you lose the factor clustering feature of ctree(); each node will use just one level since they are now different columns.

Bakaburg
  • 3,165
  • 4
  • 32
  • 64
  • This is an interessting idea, but when you have a lot of categories, they might not fullfill the minsize for example. Do you have a solution / idea for this? – Jakob Gepp Dec 17 '19 at 07:18
  • Uhm, I'm not sure what you mean. If you mean that very rare levels will be ignored is indeed true. From my experience, one-hotting the categorical predictors before running the tree does increase specificity at the cost of sensitivity. For example, If you have a categorical predictor with many levels, one-hotting will select only those levels which are strikingly predictive of the outcome and will ignore the others, instead with the untransformed predictors you will have groups of levels in each rule, but maybe you won't never get down to a specific category unless your tree is very deep. – Bakaburg Dec 18 '19 at 17:05
  • So it's a tradeoff. One-hotting as two more good effects: 1) is less variable, i.e. the rules are less impacted by random variations in the data (a big problem with decision trees in general). 2) the algorithm is way faster in selecting variables than in grouping them (orders of magnitude faster) – Bakaburg Dec 18 '19 at 17:08