Questions tagged [apache-spark-ml]

Spark ML is a high-level API for building machine learning pipelines in Apache Spark.

Related tags: , ,

External resources:

925 questions
0
votes
1 answer

Finding items which are similar

I have a big database of many items of retail company. If I would like to find the items which are similar to any particular item, can I use pearson correlation in Spark ML to do that? Is there any other better algorithm to do it? How do I make sure…
passionate
  • 503
  • 2
  • 7
  • 25
0
votes
1 answer

Using CategoricalFeaturesInfo with DecisionTreeClassifier method in Spark

I have to use this code: val dt = new DecisionTreeClassifier().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures").setImpurity(impurity).setMaxBins(maxBins).setMaxDepth(maxDepth); I need to add categorical features information so that the…
0
votes
1 answer

Using MatrixUDT as column in SparkSQL Dataframe

I'm trying to load set of medical images into spark SQL dataframe. Here each image is loaded into matrix column of dataframe. I see spark recently added MatrixUDT to support this kind of cases, but i don't find a sample for using in dataframe.…
0
votes
1 answer

Got OutOfMemory when run Spark MLlib kmeans

I always got OutOfMemory error when I ran Spark Kmeans on big data set. The training set about 250GB, I have 10 nodes spark cluster each machine with 16 cpus and 150G memory. I give the job 100GB memory on each node and 50 cpus totally. I set the…
Jack
  • 5,540
  • 13
  • 65
  • 113
0
votes
1 answer

spark-ml naive bayes save to hdfs

I know through spark-mllib we can save naive bayes model to hdfs by save() method . But we I try with spark-ml naive bayes to save into hdfs then it giving error . Wrong FS: hdfs://localhost:8020/pa/model/nb, expected: file:/// I am using…
0
votes
1 answer

How Spark MLlib deal with Java program?

I was wondering how Spark deal with Java program calling some machine learing algorithm provided by MLlib. Do I need to download Spark Project ML Library? What's more, where is the source code of MLlib for Java API ? I can't find it in it's…
Hereme
  • 193
  • 1
  • 1
  • 5
0
votes
1 answer

Spark ML Word2Vec Serialization Issues

Spark Version: 1.6.1 I have recently refactored our Word2Vec code to move to DataFrame based ml models, but I am having problem in serializing and loading the model locally. I am able to successfully: Fit the dataframe and create the…
skgemini
  • 600
  • 4
  • 7
0
votes
1 answer

Is it inefficient to manually iterate Spark SQL data frames and create column values?

In order to run a few ML algorithms, I need to create extra columns of data. Each of these columns involves some fairly intense calculations that involves keeping moving averages and recording information as you go through each row (and updating it…
Eric Staner
  • 969
  • 2
  • 9
  • 14
0
votes
1 answer

Trainning a spark ml linear regresion model fail after migrating to 1.6.1

I use spark-ml to train a linear regression model. It worked perfectly with spark version 1.5.2 but now with 1.6.1 I get the following error : java.lang.AssertionError: assertion failed: lapack.dppsv returned 228. It seems to be related to some…
philippe
  • 121
  • 1
  • 6
0
votes
1 answer

sc.parallelize not working in the ML pipeline with the training algorithm

With org.apache.spark.mllib learning algorithms, we used to set the pipeline without the training algorithm var stages: Array[org.apache.spark.ml.PipelineStage] = index_transformers :+ assembler val pipeline = new Pipeline().setStages(stages) and…
Abhishek
  • 3,337
  • 4
  • 32
  • 51
0
votes
1 answer

How to avoid hardcoding in column selection in data frame in apache spark | Scala

I have the following data frame and I need to run logistic regression using spark ml on it: uid a b c label d 1 0 1 3 0 2 2 3 0 0 1 0 While using the the ml package, i came to know that I need to create the data in the…
hbabbar
  • 947
  • 4
  • 15
  • 33
0
votes
1 answer

Error with RDD[Vector] in function parameter

I am trying to define a function in scala to iterate on it with Spark. Here is my code : import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.SQLContext import org.apache.spark.ml.{Pipeline, PipelineModel} import…
0
votes
0 answers

Spark ML, parameter for "rawPredictionCol" for Binary Classification

I want to use The binary Classificator in Spark.ml to evaluate my model after my Pipeline. I use this code : val gbt = new GBTClassifier() .setLabelCol("Label_Index") .setFeaturesCol("features") .setMaxIter(10) .setMaxDepth(7) …
pierre_comalada
  • 300
  • 3
  • 11
0
votes
1 answer

Why does LogisticRegressionModel fail at scoring of libsvm data?

Load the data that you want score. The data is stored in libsvm format in the following manner: label index1:value1 index2:value2 ... (the indices are one-based and in ascending order) Here is the sample data 100 10:1 11:1 208:1 400:1…
user3803714
  • 5,269
  • 10
  • 42
  • 61
0
votes
1 answer

Overwriting ML model in S3 bucket

I am saving an ML model to an S3 bucket. After a long search this thread helped me find a solution. My code looks as follows: sc.parallelize(Seq(model), 1).saveAsObjectFile("s3a://bucket/nameModel.model") The first time a run this job everything…
RudyVerboven
  • 1,204
  • 1
  • 14
  • 31