3

I am having trouble understanding the RCF algorithm, particularly how it expects / anticipates data or the pre-processing that should be completed? For example, I have the following data/features (with example values) for about 500K records):

valuesandheader

The results of my RCF model (trained on 500K records for 57 features - amount, 30 countries dummied, and 26 categories dummied) is extremely focused on the amount feature (e.g., all anomalies are above approx. 1000.00 which absolutely no variation based on the country or type).

Also, I also normalized the amount field and the results for that are also not really that strong. In fact, its safe to say I the results are terrible and I am clearly missing something with this.

Overall, I am looking for some guidance on getting the features right (again - 1 amount field and 2 fields that are categorical and dummied 1 and 0 - resulting in about 57 fields). Im wondering if I am better off with something like kmeans.

EDIT: Some context here... I am wondering:
1) Weighting - Is there is a way to give weight to certain variables (i.e., one of the categorical variables is more important than the other). For example, I am using Country and Category as key attributes and want to give more weight to Category over Country.
2) Context - How can I ensure outliers are considered in context of its peers (the categorical data)? For example, a transaction of $5000 for an "airfare" expense is not an outlier for that category but would be for any other. I could create N numbers of models, but that would get messy and cumbersome, right?

I looked through most of the available documentation (https://docs.aws.amazon.com/sagemaker/latest/dg/rcf_how-it-works.html) and cannot find anything that describes this!

Thank you so much for your help in advance!

EDIT: Not sure its critical at this point where I dont even have semi-reasonable results, but I have used the following hyperparameters:
num_samples_per_tree=256,
num_trees=100

theStud54
  • 705
  • 1
  • 8
  • 19

1 Answers1

1

I have never used Amazon RCF, but in general tree based models do not perform particularly well when using One Hot Encoding (or dummy encoding). In that sense, I would rather use a Numeric Encoding (giving numbers from 1 to len(category)) or a Binary Encoder (same thing, but with binary variables). This should allow the trees to have more meaningful splits on those variables.

In terms of Hyperparameters is hard to say, num_samples_per_trees depends on the ratio of outliers you expect to have, while num_trees will impact the amount of data in each partition, and therefore the size of the single trees, so it depends on the size of your dataset.

Try changing these things, and if you see no improvement you can try different stuff. but I suggest DBSCAN over Kmeans honestly, but to my knowledge they all need the definition of some distance or measure between your points, which is not trivial since you are using a mix of categorical an numeric variables

EDIT:
1 - No, I dont think there's a way to weight features in RCF, like usually there's no way to do it in any tree based algorithm as long as I know. However, if you use distance based methods (hierarchical clustering, Kmeans, etc.) you define your own distance metric that weights differently on your features
2 - Well, that's what the algorithm is for. It is supposed to find outliers based on the distribution of all features, not just one.

You can also try Isolation Forest if you want. It does not require any metric and it is easier to understand than RCF in my opinion.

Davide ND
  • 856
  • 6
  • 13
  • Davide DN - Thank you for this. I am realizing this now... DBSCAN looks promising and maybe even OPTICS from sklearn. I am going to test out numeric encoding with RCF first. I also added some context to my question above. Would you mind checking it out? – theStud54 Dec 08 '19 at 14:44
  • I did perform some the numeric labeling for the categories but I can see that the model likely isn't performing meaningful splits for the variable that has 25 values ("category"). I can imagine that its picking some value to split on for that field to which split on, but that, of course, doesnt make sense. Thus, it is still relying too heavily on the value field to make the sole determination of whether its an outlier. I am at a lose for how to take consideration of this category variable. Thanks for any further thoughts! PS - I am working on the DBSAN / OPTICS model too. – theStud54 Dec 09 '19 at 18:27
  • What is even more baffling is that when i bring RCF down to two variables, a continuous measure (unchanged) and a single encoded variable (0 if XXX, 1 if YYY), there is still not good differentiation. It literally only focuses on the continuous measure - i.e., anything over XXX is an outlier. Ugh - I must be missing something. – theStud54 Dec 09 '19 at 18:50
  • Well, that can easily be the case if the distribution of the continuous variable for both XXX and YYY is the same – Davide ND Dec 09 '19 at 19:24