1

I'm trying to understand that is it possible to apply iForest directly to an extremely large dataset that is static (fixed size, both on cardinality and dimensionality), without using distributed processing frameworks like hadoop or spark? Or even, is such a dataset considered a Big Data?

When I say directly, I mean that there is no need to load the whole data into the RAM, because as you do know, iForest uses subsampling for making iTrees and I don't exactly know where the disk I/O speed has any effect on the performance of the algorithm or not!

Actually, I've developed a new method as my M.S. thesis, for local outlier detection in Big Data that is based on an old scalable clustering algorithm named BFR, but with a slight difference about structure of Gaussian clusters that they could be correlated. Like BFR, it dosen't need to load the whole data into RAM and scans the whole data chunk-by-chunk. It takes a random sample of whole data at first for gaining the very first clustering information and then applies a scalable clustering to complete the clustering model and finally by another scan of the entire dataset, it gives to each object an outlier score named SDCOR (Scalable Density-based Clustering Outlierness Ratio). But the thing is that the type of data that I've used is static and not a stream one and even the maximum size of the synthetic data is about 1 million by 40 dimension and its volume is less than 400 megabytes. But I've proved both theoretically and empirically, that it is scalable and its time complexity is linear with a low constant, as for mentioned 1e6-by-40 dataset it finishes the processing with 100% of AUC in about 4 minutes, and I'm sure that it could even be less by improving the implementation. I've implemented the whole method in MATLAB 9 and even made a lovely GUI and currently I'm writing down a paper of my mentioned thesis, but I'm worried about the referees' feedbacks on the gist of the paper that claims on Big Data stuff!

Here is the table of final results of my method (SDCOR) and other competing methods on real-life and synthetic datasets: Final results of SDCOR and competing methods Note: The bold-faced values are the best among all methods.

Here is a screenshot of my GUI: SDCOR screenshot on a synth. dataset

Any useful comments will be welcome! ;-) Thank you ...

SANN
  • 31
  • 1
  • 12

2 Answers2

2

iForest is an extremely simple (and thus fast) density estimation technique based on subsampling. It's not the most clever technique; it mostly shows one of the evaluation deficits in outlier detection, and that many data sets are best solved by simple density estimation (you always should include kNN outlier with k=1,2,5,10 because of this - because on incredibly many data sets, this trivial approach will perform very well). And of course on such density data, iForest will shine, being a very fast approximate density estimator. Results of iForest will usually correlate with kNN, but it will scale better because of the approximation.

In your screenshots, such data sets are largely meaningless. Whenever you have a data set where any method has 100% it just means the data is way too idealized, and this is overfitting.

Yes, you can easily scale it to absurd size data, because it only uses a sample anyway.

For coordinate data, never use Spark etc. - after preprocessing, such data pretty much never is big enough to warrant the overheads of Spark (just do the math: how many data points fit into main memory? - you cannot have "big data" with low dimensional point data, you need text, graphs, photos, or better, videos). It's almost always more efficient - and possible with today's memory - to use in-memory indexed solution such as ELKI than Matlab, because as far as I know, Matlab doesn't have such indexes. If you want to become even faster, use approximate nearest neighbors such as libANN or FLANN (but you'll need to write the code yourself, I usually use ELKI because it has almost every important method ready to try).

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Thanks for your nice answer @Anony-Mousse. But I still didn't get that is iForest capable of finding outliers without loading the whole data into the RAM?! I mean is it possible to take subsamples from a data that is still maintained in disk not in RAM and do it for many times to make the iForest? And about the datasets, not for all of them the final result is 100% but for many of them, they're competing as I see! I was worried about their size. – SANN Jul 24 '18 at 19:02
  • And also about your comment on coordinate data that is not Big enough after preprocessing, I didn't get the point! What happens in preprocessing step. Actually I don't know enough about distributed frameworks as I was forbidden to use them and was forced to solve the problem by only using one system! Anyway! And about the dimensionality you said, I guess you agree that a static data with fixed cardinality and dimensionality (and absolutely extremely large number of data points) could be considered as a Big Data! Is this correct? Thanks a lot ... – SANN Jul 24 '18 at 19:02
  • No, I say that real point data never is that "big". Don't use synthetic data, it's meaningless. – Has QUIT--Anony-Mousse Jul 25 '18 at 05:07
-1
  • Outlier is not always a "villain". Outliers usually have different characteristics from normal instances and always indicate valuable information and knowledge in a data set. Therefore outlier detection plays important roles in various applications. Therefore, its important to justify if the outlier was indeed an information nugget or was an error in data collection. Once this distinction is clearly made, further analysis can be done. The point here is, if you've not collected the data and there exists outliers in the dataset, then you need a very strong reason to first prove that its an outlier and then do the outlier treatment.

  • As long as the dataset can fit in the secondary memory of a single standalone computer, its NOT big data. Random Access Memory (RAM) is for processing and not storing. Based on this analogy, real time weather sensors data can be considered big because its real time. Note, the weather sensor data may be recording just a few dimensions like air-quality. Another example are the tweets on twitter.

mnm
  • 1,962
  • 4
  • 19
  • 46
  • I appreciate your response @Ashish, but it doesn't make any sense at least for me that big data is something that cannot be stored in secondary memory not in the main memory or RAM!? I mean that data is read i.e. from a database using a framework or any other way. If a data cannot be stored in secondary memory, so where it's about to be stored finally?! That's the question coming from your answer. After all, there are so many algorithms that need the whole data to be loaded in RAM at first and also need some space for processing, so it's a big deal having an algorithm that is not like this. – SANN Jul 26 '18 at 08:43
  • @SANN seems you've misunderstood my response. I urge you to read carefully what I've written! We all know primary memory is RAM and secondary memory is HDD (I see the pain point, I've updated my answer). Do you agree on this or not? Now, coming to the concept of `big data` refers to satisfying the `3Vs`, `volume`, `velocity` and `variety`. It's important for a dataset to satisfy these 3Vs else its not big data. Perhaps, you are confusing the `frameworks for big data` like Hadoop which is essentially a cluster based framework. – mnm Jul 26 '18 at 09:11
  • Now your answer seems much more reasonable @Ashish! Sorry, I was confused by the last one. But about 3Vs, the second V `velocity` implies that the data should be a `stream` one not `static`, am I right?! If this is true, there is no way to tell that a static data with a gigantic volume could be considered a Big Data! Is this correct anyway? (And if it's true, my whole efforts has been for nothing!) – SANN Jul 26 '18 at 09:38
  • 1
    @SANN no problem. Let me give you an honest advice, if all this effort is for a peer-reviewed academic journal, I urge you to rethink the paper. Because if you do not, most probably it will be `rejected` by the reviewers. If on the contrary, this exercise is for business, then its okay. I say ok because in my experience most of the business stakeholders are only concerned with a solution that has something called "BIG DATA" in it. – mnm Jul 27 '18 at 02:30
  • I appreciate your advice @Ashish, but it's definitely for a peer-reviewed academic journal! Although, I think I got the solution somehow! There are some references that are considering High Volume Static Data as **`Big Static Data`** like [this](https://content.taylorfrancis.com/books/download?dac=C2016-0-01988-0&isbn=9781498797610&format=googlePreviewPdf) and [this one](http://library.utia.cas.cz/separaty/2014/AS/dedecius-0431085.pdf), and therefore I'm going to change the title of the paper and add "static" term to it. Let's see what would happen! ;-) – SANN Jul 27 '18 at 04:24
  • 1
    @SANN I think the term `BIG` is one of the most `raped & exploited` terms in the current time. The researchers see it as an easy way to get published whilst the businesses see it as an easy way to wrap `vague solutions` to sell. So if your writing a technical paper and are unable to justify the analysis, it will be rejected in the first round itself. Thus, I suggested in my answer too, "outliers are not always the villains". Anyway, best of luck. – mnm Jul 27 '18 at 04:48
  • You're definitely right about the `BIG` stuff @Ashish. But as I said the title will limit the whole concern on a specific type of `Big Data` and I just hope that for at least 10% it could be considered true! and about the analysis, I think that I've done a good one. And about the `villainness` of outliers, I think I'm just about to find them like many works that do that, but not to interpret them like some rare others, and I guess it's alright though. Thanks again ... – SANN Jul 27 '18 at 05:22