Highest Voted 'apache-spark-ml' Questions

17

votes

1 answer

How to get word details from TF Vector RDD in Spark ML Lib?

I have created Term Frequency using HashingTF in Spark. I have got the term frequencies using tf.transform for each word. But the results are showing in this format. [,…

apache-spark apache-spark-mllib tf-idf apache-spark-ml

asked Aug 29 '15 at 11:46

Srini

3,334
6
29
64

16

votes

1 answer

Field "features" does not exist. SparkML

I am trying to build a model in Spark ML with Zeppelin. I am new to this area and would like some help. I think i need to set the correct datatypes to the column and set the first column as the label. Any help would be greatly appreciated, thank…

scala apache-zeppelin apache-spark-ml

asked Jul 06 '17 at 13:51

Young4844

247
1
4
12

16

votes

1 answer

How to convert ArrayType to DenseVector in PySpark DataFrame?

I'm getting the following error trying to build a ML Pipeline: pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually…

python apache-spark pyspark apache-spark-mllib apache-spark-ml

asked Aug 18 '16 at 19:02

Evan Zamir

8,059
14
56
83

16

votes

1 answer

Spark ML indexer cannot resolve DataFrame column name with dots?

I have a DataFrame with a column named a.b. When I specify a.b as the input column name to a StringIndexer, AnalysisException with the message "cannot resolve 'a.b' given input columns a.b". I'm using Spark 1.6.0. I'm aware that older versions of…

java apache-spark apache-spark-mllib apache-spark-ml

asked Jan 22 '16 at 18:22

Joshua Taylor

84,998
9
154
353

16

votes

3 answers

Spark, Scala, DataFrame: create feature vectors

I have a DataFrame that looks like follow: userID, category, frequency 1,cat1,1 1,cat2,3 1,cat9,5 2,cat4,6 2,cat9,2 2,cat10,1 3,cat1,5 3,cat7,16 3,cat8,2 The number of distinct categories is 10, and I would like to create a feature vector for each…

scala apache-spark apache-spark-sql apache-spark-ml

asked Nov 23 '15 at 08:45

Rami

8,044
18
66
108

16

votes

1 answer

Preserve index-string correspondence spark string indexer

Spark's StringIndexer is quite useful, but it's common to need to retrieve the correspondences between the generated index values and the original strings, and it seems like there should be a built-in way to accomplish this. I'll illustrate using…

python apache-spark apache-spark-sql pyspark apache-spark-ml

asked Nov 10 '15 at 18:24

moustachio

2,924
3
36
68

16

votes

1 answer

Why spark.ml don't implement any of spark.mllib algorithms?

Following the Spark MLlib Guide we can read that Spark has two machine learning libraries: spark.mllib, built on top of RDDs. spark.ml, built on top of Dataframes. According to this and this question on StackOverflow, Dataframes are better (and…

machine-learning apache-spark pyspark apache-spark-mllib apache-spark-ml

asked Oct 20 '15 at 12:47

Paladini

4,522
15
53
96

16

votes

1 answer

Is it possible to access estimator attributes in spark.ml pipelines?

I have a spark.ml pipeline in Spark 1.5.1 which consists of a series of transformers followed by a k-means estimator. I want to be able to access the KMeansModel.clusterCenters after fitting the pipeline, but can't figure out how. Is there a…

scala apache-spark pipeline apache-spark-ml

asked Oct 19 '15 at 17:04

hilarious

511
3
9

15

votes

1 answer

Spark ML VectorAssembler returns strange output

I am experiencing a very strange behaviour from VectorAssembler and I was wondering if anyone else has seen this. My scenario is pretty straightforward. I parse data from a CSV file where I have some standard Int and Double fields and I also…

scala apache-spark apache-spark-mllib apache-spark-ml

asked Nov 09 '16 at 11:22

Dimitris

2,030
3
27
45

15

votes

2 answers

Should we parallelize a DataFrame like we parallelize a Seq before training

Consider the code given here, https://spark.apache.org/docs/1.2.0/ml-guide.html import org.apache.spark.ml.classification.LogisticRegression val training = sparkContext.parallelize(Seq( LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)), …

scala apache-spark pyspark apache-spark-sql apache-spark-ml

asked May 31 '16 at 23:02

Abhishek

3,337
4
32
51

15

votes

4 answers

Tuning parameters for implicit pyspark.ml ALS matrix factorization model through pyspark.ml CrossValidator

I'm trying to tune the parameters of an ALS matrix factorization model that uses implicit data. For this, I'm trying to use pyspark.ml.tuning.CrossValidator to run through a parameter grid and select the best model. I believe my problem is in the…

python apache-spark pyspark apache-spark-ml

asked May 16 '16 at 18:36

ilyab

229
2
8

15

votes

3 answers

How to save models from ML Pipeline to S3 or HDFS?

I am trying to save thousands of models produced by ML Pipeline. As indicated in the answer here, the models can be saved as follows: import java.io._ def saveModel(name: String, model: PipelineModel) = { val oos = new ObjectOutputStream(new…

java scala apache-spark apache-spark-mllib apache-spark-ml

asked Aug 30 '15 at 01:09

SH Y.

1,709
3
20
21

14

votes

5 answers

Pyspark ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:50532)

Hello I was working with Pyspark, implementing a sentiment analysis project using ML package for the first time. The code was working good but suddenly it becomes showing the error mentioned above: ERROR:py4j.java_gateway:An error occurred while…

pyspark apache-spark-ml py4j

asked Jul 16 '18 at 10:33

jowwel93

183
1
2
10

14

votes

1 answer

Create labeledPoints from Spark DataFrame in Python

What .map() function in python do I use to create a set of labeledPoints from a spark dataframe? What is the notation if The label/outcome is not the first column but I can refer to its column name, 'status'? I create the Python dataframe with this…

python pandas apache-spark apache-spark-mllib apache-spark-ml

asked Sep 14 '15 at 01:29

user1518003

321
1
7
18

13

votes

3 answers

How do I convert an RDD with a SparseVector Column to a DataFrame with a column as Vector

I have an RDD with a tuple of values (String, SparseVector) and I want to create a DataFrame using the RDD. To get a (label:string, features:vector) DataFrame which is the Schema required by most of the ml algorithm's libraries. I know it can be…

apache-spark pyspark apache-spark-sql apache-spark-mllib apache-spark-ml

asked Sep 23 '15 at 16:47

Orangel Marquez

375
2
12

Questions tagged [apache-spark-ml]