Questions tagged [apache-spark-mllib]

MLlib is a machine learning library for Apache Spark

MLlib is a low level, RDD based machine learning library for Apache Spark

External links:

Related tags:

,

2241 questions
1
vote
1 answer

Spark MLlib collaborative filtering---how to view movie factors?

I am working through this tutorial https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html . How would one view the factors associated with each movie? In other words, how do I look at the model that has been trained?
1
vote
1 answer

Spark 1.3.1 install failed in MLlib when I run make-distribution.sh in Ubuntu 14.04

Spark 1.3.1 install failed in MLlib when I run make-distribution.sh in Ubuntu 14.04 Java -version: java version "1.7.0_80" Java(TM) SE Runtime Environment (build 1.7.0_80-b15) Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode) Scala…
zt1983811
  • 1,011
  • 3
  • 14
  • 34
1
vote
1 answer

spark 1.2.0 mllib kmeans: Out Of Memory Error

I'am new to spark, and I use KMeans algorithm to cluster a data set, which size is 484M, 213104 dimensions, and my code as follow: val k = args(0).toInt val maxIter = args(1).toInt val model = new…
ifloating
  • 23
  • 3
1
vote
1 answer

Spark - Naive Bayes classifier value error

I have the following issue when training a Naive Bayes classifier. I'm getting this error: File "/home/juande/Desktop/spark-1.3.0-bin-hadoop2.4/python/pyspark/mllib /classification.py", line 372, in train return NaiveBayesModel(labels.toArray(),…
user3276768
  • 1,416
  • 3
  • 18
  • 28
1
vote
1 answer

Spark - MLlib linear regression intercept and weight NaN

I have trying to build a regression model on Spark using some custom data and the intercept and weights are always nan. This is my data: data = [LabeledPoint(0.0, [27022.0]), LabeledPoint(1.0, [27077.0]), LabeledPoint(2.0, [27327.0]),…
user3276768
  • 1,416
  • 3
  • 18
  • 28
1
vote
2 answers

Spark: Read CSV file with headers

I have a CSV file with 90 columns and around 28000 rows. I want to load it and split it in train (75%) and test (25%). I used the following code: Code: val data = sc.textFile(datadir + "/dados_frontwave_corte_pedra_ferramenta.csv") .map(line…
Mohammad
  • 1,006
  • 2
  • 15
  • 29
1
vote
2 answers

Spark Categorical Data Encoding

Is there a function in Spark to do Categorical data encoding. Ex: Var1,Var2,Var3 1,2,a 2,3,b 3,2,c To var1,var2,var3 1,2,0 2,3,1 3,2,2 a -> 0, b->1, c->2
Joel
  • 1,650
  • 2
  • 22
  • 34
1
vote
2 answers

Categorical Variables in Apache Spark using MLib

I am relatively new to the world of Apache Spark. I am trying to estimate a large scale model using LinearRegressionWithSGD() where I would like to estimate fixed effects and interaction terms without having to create a huge design matrix. I noticed…
1
vote
1 answer

Spark - Prediction.io - scala.MatchError: null

I'm working on a template for prediction.io and I'm running into trouble with Spark. I keep getting a scala.MatchError error: full gist here scala.MatchError: null at org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:831) at…
1
vote
1 answer

spark mllib memory error on svd (single machine)

I have a large data file (around 4 GB) and I am analyzing it using spark on a single pc. scala> x res29: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@5a86096a scala> x.numRows res27:…
Donbeo
  • 17,067
  • 37
  • 114
  • 188
1
vote
0 answers

Spark MLlib logs deprecated properties

I followed the training from databricks. It runs on Azure and has been build with this configuration: build.sbt import AssemblyKeys._ assemblySettings name := "movielens-als" version := "0.1" scalaVersion := "2.11.4" libraryDependencies +=…
erwineberhard
  • 309
  • 4
  • 17
1
vote
1 answer

Run time error in scala : NoSuchMethodError

I am trying to use Spark MLlib algorithm's in Scala language in eclipse. There are no problems during compilation and while running there is an error saying "NoSuchMethodError". Here is my code #Copied import org.apache.spark.SparkConf import…
Jack Daniel
  • 2,527
  • 3
  • 31
  • 52
1
vote
0 answers

Use of similarity function and RowMatrix in apache spark

I need to compute similarity between average vector computed from RowMatrix and all vectors inside same RowMatrix. To compute average vector I am doing this (example in Java): RowMatrix matrix = new RowMatrix(vectorOfUserToItems.rdd()); Vector…
Adrian
  • 71
  • 1
  • 12
1
vote
1 answer

Use of foreachActive for spark Vector in Java

How to write simple code in Java which iterate over active elements in sparse vector? Lets say we have such Vector: Vector sv = Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0}); I was trying with lambda or Function2 (from three…
Adrian
  • 71
  • 1
  • 12
1
vote
0 answers

Requested array size exceeds VM limit in MLLib Random Forest

I'm using MLLib to train a random forest. It's working fine to depth 15, but if I use depth 20 I get java.lang.OutOfMemoryError: Requested array size exceeds VM limit on the driver, from the collectAsMap operation in DecisionTree.scala, around…