Questions tagged [apache-spark-mllib]

MLlib is a machine learning library for Apache Spark

MLlib is a low level, RDD based machine learning library for Apache Spark

External links:

Related tags:

,

2241 questions
10
votes
1 answer

How to convert org.apache.spark.rdd.RDD[Array[Double]] to Array[Double] which is required by Spark MLlib

I am trying to implement KMeans using Apache Spark. val data = sc.textFile(irisDatasetString) val parsedData = data.map(_.split(',').map(_.toDouble)).cache() val clusters = KMeans.train(parsedData,3,numIterations = 20) on which I get the following…
sand
  • 137
  • 1
  • 2
  • 9
10
votes
2 answers

Mllib dependency error

I'm trying to build a very simple scala standalone app using the Mllib, but I get the following error when trying to bulid the program: Object Mllib is not a member of package org.apache.spark Then, I realized that I have to add Mllib as dependency…
user3789843
  • 1,009
  • 2
  • 11
  • 18
10
votes
3 answers

How do I run the Spark decision tree with a categorical feature set using Scala?

I have a feature set with a corresponding categoricalFeaturesInfo: Map[Int,Int]. However, for the life of me I cannot figure out how I am supposed to get the DecisionTree class to work. It will not accept anything, but a LabeledPoint as data.…
9
votes
3 answers

Spark v3.0.0 - WARN DAGScheduler: broadcasting large task binary with size xx

I'm new to spark. I'm coding a machine learning algorithm in Spark standalone (v3.0.0) with this configurations set: SparkConf conf = new SparkConf(); conf.setMaster("local[*]"); conf.set("spark.driver.memory",…
vittoema96
  • 121
  • 1
  • 1
  • 6
9
votes
1 answer

What Type should the dense vector be, when using UDF function in Pyspark?

I want to change List to Vector in pySpark, and then use this column to Machine Learning model for training. But my spark version is 1.6.0, which does not have VectorUDT(). So what type should I return in my udf function? from pyspark.sql import…
9
votes
1 answer

Vector assembler in Pyspark is creating tuple of multiple vectors instead of a single vector, how to solve the issue?

My python version is 3.6.3 and spark version is 2.2.1. Here is my code: from pyspark.ml.linalg import Vectors from pyspark.ml.feature import VectorAssembler from pyspark import SparkContext, SparkConf from pyspark.sql import SparkSession sc =…
Mir Md Faysal
  • 477
  • 3
  • 13
9
votes
2 answers

Comparing two arrays and getting the difference in PySpark

I have two array fields in a data frame. I have a requirement to compare these two arrays and get the difference as an array(new column) in the same data frame. Expected output is: Column B is a subset of column A. Also the words is going to be in…
jiks-hue
  • 139
  • 1
  • 1
  • 7
9
votes
1 answer

How to extract vocabulary from Pipeline

I can extract vocabulary from CountVecotizerModel by the following way fl = StopWordsRemover(inputCol="words", outputCol="filtered") df = fl.transform(df) cv = CountVectorizer(inputCol="filtered", outputCol="rawFeatures") model =…
user2377528
9
votes
2 answers

How can I evaluate the implicit feedback ALS algorithm for recommendations in Apache Spark?

How can you evaluate the implicit feedback collaborative filtering algorithm of Apache Spark, given that the implicit "ratings" can vary from zero to anything, so a simple MSE or RMSE does not have much meaning?
Dimitris Poulopoulos
  • 1,139
  • 2
  • 15
  • 36
9
votes
0 answers

Non-linear SVM is not available in Apache Spark

Does avyone know the reason why the Non-Linear SVM has not been implemented in Apache Spark? I was reading this page: https://issues.apache.org/jira/browse/SPARK-4638 Look at the last comment. It says: "Commenting here b/c of the recent dev list…
Vitrion
  • 405
  • 5
  • 14
9
votes
1 answer

How to do prediction with Sklearn Model inside Spark?

I have trained a model in python using sklearn. How we can use same model to load in Spark and generate predictions on a spark RDD ?
Tanveer
  • 890
  • 12
  • 22
9
votes
2 answers

Online learning of LDA model in Spark

Is there a way to train a LDA model in an online-learning fashion, ie. loading a previously train model, and update it with new documents ?
9
votes
2 answers

(Spark) object {name} is not a member of package org.apache.spark.ml

I'm trying to run self-contained application using scala on apache spark based on example here: http://spark.apache.org/docs/latest/ml-pipeline.html Here's my complete code: import org.apache.spark.ml.classification.LogisticRegression import…
Yusata
  • 199
  • 1
  • 3
  • 16
9
votes
4 answers

How to create a Row from a List or Array in Spark using java

In Java, I use RowFactory.create() to create a Row: Row row = RowFactory.create(record.getLong(1), record.getInt(2), record.getString(3)); where "record" is a record from a database, but I cannot know the length of "record" in advance, so I want to…
user2736706
  • 103
  • 1
  • 1
  • 5
9
votes
1 answer

Speed up collaborative filtering for large dataset in Spark MLLib

I'm using MLlib's matrix factorization to recommend items to users. I have about a big implicit interaction matrix of M=20 million users and N=50k items. After training the model I want to get a short list(e.g. 200) of recommendations for each user.…