Questions tagged [apache-spark-ml]

Spark ML is a high-level API for building machine learning pipelines in Apache Spark.

Related tags: , ,

External resources:

925 questions
17
votes
1 answer

How to get word details from TF Vector RDD in Spark ML Lib?

I have created Term Frequency using HashingTF in Spark. I have got the term frequencies using tf.transform for each word. But the results are showing in this format. [,
Srini
  • 3,334
  • 6
  • 29
  • 64
16
votes
1 answer

Field "features" does not exist. SparkML

I am trying to build a model in Spark ML with Zeppelin. I am new to this area and would like some help. I think i need to set the correct datatypes to the column and set the first column as the label. Any help would be greatly appreciated, thank…
Young4844
  • 247
  • 1
  • 4
  • 12
16
votes
1 answer

How to convert ArrayType to DenseVector in PySpark DataFrame?

I'm getting the following error trying to build a ML Pipeline: pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually…
Evan Zamir
  • 8,059
  • 14
  • 56
  • 83
16
votes
1 answer

Spark ML indexer cannot resolve DataFrame column name with dots?

I have a DataFrame with a column named a.b. When I specify a.b as the input column name to a StringIndexer, AnalysisException with the message "cannot resolve 'a.b' given input columns a.b". I'm using Spark 1.6.0. I'm aware that older versions of…
Joshua Taylor
  • 84,998
  • 9
  • 154
  • 353
16
votes
3 answers

Spark, Scala, DataFrame: create feature vectors

I have a DataFrame that looks like follow: userID, category, frequency 1,cat1,1 1,cat2,3 1,cat9,5 2,cat4,6 2,cat9,2 2,cat10,1 3,cat1,5 3,cat7,16 3,cat8,2 The number of distinct categories is 10, and I would like to create a feature vector for each…
Rami
  • 8,044
  • 18
  • 66
  • 108
16
votes
1 answer

Preserve index-string correspondence spark string indexer

Spark's StringIndexer is quite useful, but it's common to need to retrieve the correspondences between the generated index values and the original strings, and it seems like there should be a built-in way to accomplish this. I'll illustrate using…
moustachio
  • 2,924
  • 3
  • 36
  • 68
16
votes
1 answer

Why spark.ml don't implement any of spark.mllib algorithms?

Following the Spark MLlib Guide we can read that Spark has two machine learning libraries: spark.mllib, built on top of RDDs. spark.ml, built on top of Dataframes. According to this and this question on StackOverflow, Dataframes are better (and…
16
votes
1 answer

Is it possible to access estimator attributes in spark.ml pipelines?

I have a spark.ml pipeline in Spark 1.5.1 which consists of a series of transformers followed by a k-means estimator. I want to be able to access the KMeansModel.clusterCenters after fitting the pipeline, but can't figure out how. Is there a…
hilarious
  • 511
  • 3
  • 9
15
votes
1 answer

Spark ML VectorAssembler returns strange output

I am experiencing a very strange behaviour from VectorAssembler and I was wondering if anyone else has seen this. My scenario is pretty straightforward. I parse data from a CSV file where I have some standard Int and Double fields and I also…
Dimitris
  • 2,030
  • 3
  • 27
  • 45
15
votes
2 answers

Should we parallelize a DataFrame like we parallelize a Seq before training

Consider the code given here, https://spark.apache.org/docs/1.2.0/ml-guide.html import org.apache.spark.ml.classification.LogisticRegression val training = sparkContext.parallelize(Seq( LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)), …
Abhishek
  • 3,337
  • 4
  • 32
  • 51
15
votes
4 answers

Tuning parameters for implicit pyspark.ml ALS matrix factorization model through pyspark.ml CrossValidator

I'm trying to tune the parameters of an ALS matrix factorization model that uses implicit data. For this, I'm trying to use pyspark.ml.tuning.CrossValidator to run through a parameter grid and select the best model. I believe my problem is in the…
ilyab
  • 229
  • 2
  • 8
15
votes
3 answers

How to save models from ML Pipeline to S3 or HDFS?

I am trying to save thousands of models produced by ML Pipeline. As indicated in the answer here, the models can be saved as follows: import java.io._ def saveModel(name: String, model: PipelineModel) = { val oos = new ObjectOutputStream(new…
SH Y.
  • 1,709
  • 3
  • 20
  • 21
14
votes
5 answers

Pyspark ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:50532)

Hello I was working with Pyspark, implementing a sentiment analysis project using ML package for the first time. The code was working good but suddenly it becomes showing the error mentioned above: ERROR:py4j.java_gateway:An error occurred while…
jowwel93
  • 183
  • 1
  • 2
  • 10
14
votes
1 answer

Create labeledPoints from Spark DataFrame in Python

What .map() function in python do I use to create a set of labeledPoints from a spark dataframe? What is the notation if The label/outcome is not the first column but I can refer to its column name, 'status'? I create the Python dataframe with this…
13
votes
3 answers

How do I convert an RDD with a SparseVector Column to a DataFrame with a column as Vector

I have an RDD with a tuple of values (String, SparseVector) and I want to create a DataFrame using the RDD. To get a (label:string, features:vector) DataFrame which is the Schema required by most of the ml algorithm's libraries. I know it can be…
1 2
3
61 62