4

I'm trying to extends, or proxy, the org.apache.spark.ml.clustering.KMeans class, such that K=1 is authorized.

class K1Means extends Estimator{

    final val kmeans = new KMeans()
    val k = 1

    override def setK(value:Int) {
        if(value >1){
            this.kmeans.setK(value)
        }
    }

    override def fit(dataset: DataFrame): KMeansModel = { 
        if(this.k == 1){
            /* super specific to my case */
            val avg_sample = Vectors.zeros(
                dataset
                .select("scaledFeatures")
                .take(1)(0)(0)  // first row
                .asInstanceOf[DenseVector]  // was of type Any
                .size
            ) // with the scaling the average value of each column is 0
            var centers_local = Array(avg_sample)
            return new KMeansModel(centers_local)
        }
        else{
            return this.kmeans.fit(dataset)
        }
    }
// every method then calls this.kmeans.method()
}

I've tried this, but new KMeansModel(centers_local) is not authorized, since KMeansModel has a private constructor. Here is the error message:

constructor KMeansModel in class KMeansModel cannot be accessed in class K1Means

I also tried to extend KMeansModel, so I can create my own and return it :

class K1MeansModel(centers: Array[DenseVector]) extends KMeansModel{}

But it also fails: constructor KMeansModel in class KMeansModel cannot be accessed in class K1MeansModel

Borbag
  • 597
  • 4
  • 21
  • The docs seems to disagree with you: https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/ml/clustering/KMeansModel.html looks public to me – Meir Maor Jun 21 '16 at 10:10
  • Can you edit your question and provide the actual error message? – The Archetypal Paul Jun 21 '16 at 10:18
  • Ok I'll have to rephrase. The constructor is private maybe is the correct way to say it. It can only be instanciated by KMeans. – Borbag Jun 21 '16 at 10:20

1 Answers1

4

There are several problems here, starting with KMeansModel being private: https://github.com/apache/spark/blob/4f83ca1059a3b580fca3f006974ff5ac4d5212a1/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala#L102

Why is this a problem? You could totally write your own proxy in the way you proposed, but in order to override the "fit" method, the data type returned by that function needs to be a KMeansModel or compatible (let's say "K1MeansModel"), like this:

class K1MeansModel extends KMeansModel{
    // ...
}

class K1Means extends KMeans{

    final val kmeans = new KMeans()
    // ...

    override def fit(dataset: DataFrame): KMeansModel = { 
        if(this.k == 1){
            // ...
            return new K1MeansModel(centers_local)
        }
        else{
            return this.kmeans.fit(dataset)
        }
    }
}

But yeah, because KMeansModel is private, that's not possible. So you might think "why not reimplement it?". Indeed you could just copy & paste the whole code of KMeansModel, from GitHub.

The definition of KMeansModel looks like this:

class KMeansModel (
        override val uid: String, 
        private val parentModel: MLlibKMeansModel) 
    extends Model[KMeansModel] with KMeansParams { }

But yeah, because KMeansParams is private, that's not possible. So you might think "why not reimplement it?". Indeed you could just copy & paste the whole code of KMeansParams, from GitHub.

The definition of KMeansParams looks like this:

trait K1MeansParams 
    extends Params 
        with HasMaxIter 
        with HasFeaturesCol 
        with HasSeed 
        with HasPredictionCol 
        with HasTol { }

But yeah, because HasMaxIter, HasFeaturesCol, HasSeed, HasPredictionCol, HasTol are all private, that's not possible. ... You get the idea.


TL;DR yes, you could go and reimplement (copy&paste) a ton of spark classes into your project, just to override KMeans. I count at least 7 classes that would require copy&pasting. To me that feels shitty. Instead I'd recommend to add the code directly to Apache Spark. Fork the Spark GitHub repo, add your code for K=1 directly into the ml.KMeans class and be done with it.

Florian Golemo
  • 774
  • 1
  • 7
  • 19