I'm trying to understand that is it possible to apply iForest directly to an extremely large dataset that is static (fixed size, both on cardinality and dimensionality), without using distributed processing frameworks like hadoop or spark? Or even, is such a dataset considered a Big Data?
When I say directly, I mean that there is no need to load the whole data into the RAM, because as you do know, iForest uses subsampling for making iTrees and I don't exactly know where the disk I/O speed has any effect on the performance of the algorithm or not!
Actually, I've developed a new method as my M.S. thesis, for local outlier detection in Big Data that is based on an old scalable clustering algorithm named BFR, but with a slight difference about structure of Gaussian clusters that they could be correlated. Like BFR, it dosen't need to load the whole data into RAM and scans the whole data chunk-by-chunk. It takes a random sample of whole data at first for gaining the very first clustering information and then applies a scalable clustering to complete the clustering model and finally by another scan of the entire dataset, it gives to each object an outlier score named SDCOR (Scalable Density-based Clustering Outlierness Ratio). But the thing is that the type of data that I've used is static and not a stream one and even the maximum size of the synthetic data is about 1 million by 40 dimension and its volume is less than 400 megabytes. But I've proved both theoretically and empirically, that it is scalable and its time complexity is linear with a low constant, as for mentioned 1e6-by-40 dataset it finishes the processing with 100% of AUC in about 4 minutes, and I'm sure that it could even be less by improving the implementation. I've implemented the whole method in MATLAB 9 and even made a lovely GUI and currently I'm writing down a paper of my mentioned thesis, but I'm worried about the referees' feedbacks on the gist of the paper that claims on Big Data stuff!
Here is the table of final results of my method (SDCOR) and other competing methods on real-life and synthetic datasets:
Note: The bold-faced values are the best among all methods.
Here is a screenshot of my GUI:
Any useful comments will be welcome! ;-) Thank you ...