0

I'm trying to run spark application on standalone cluster. In this application I'm training Naive Bayes classifier by using tf-idf vectors.

I wrote application in similar manner to this post (Spark MLLib TFIDF implementation for LogisticRegression). The main difference, that I take each document, tokenize and normalize it:

JavaRDD<Document> termDocsRdd = sc.wholeTextFiles("D:/fileFolder").flatMap(new FlatMapFunction<Tuple2<String,String>, Document>() {
        @Override
        public Iterable<Document> call(Tuple2<String,String> tup) 
        { 
            return Arrays.asList(parsingFunction(tup)); 
        } 
    });

parsingFunction doesn't have any Spark functions like map or flatMap etc.. So it doesn't contain any data distribution functions.

My cluster is - One master machine and two another machines - nodes. All machines have 8 cores CPU and 16 GB RAM. I'm trying to train classifier on 20 text files (each ~ 100 KB - 1.5 MB). I don't use a distributed filesystem and put files directly to the nodes.

The problem is that my cluster doesn't work as fast as I thought - classifier trained about 5 minutes... In local mode this operation spent much less time.

On what should I pay attention?

I would appreciate any advice.

Thank You!

Community
  • 1
  • 1
dimson
  • 783
  • 2
  • 10
  • 21

1 Answers1

1

Did you cache the RDD for the training data? An iterative algorithm like training a Bayes classifier will perform poorly unless the RDD is cached.

Josh Milthorpe
  • 956
  • 1
  • 14
  • 27
  • I use RDD cache like it performs in this post [http://stackoverflow.com/questions/26897908/spark-mllib-tfidf-implementation-for-logisticregression]. Or, maybe, I should to cache each RDD in training part of my application? Thank You! – dimson Dec 11 '14 at 18:19
  • I tried to cache all RDDs and I see that performance has improved. Now MLib's Bayes training time has become instead 3.5 minutes - 1.5 minutes. How do You think - is this enough result for Spark? Training data - 30 text files (totally 30 Mb). Cluster - 1 master machine and three slave machines. Each machine have 8 cores CPU and 16 Gb RAM. Thank You! – dimson Dec 12 '14 at 12:33