Highest Voted 'apache-spark-ml' Questions

0

votes

0 answers

Doing arithmetic with Apache Spark's Vector

In order to use Spark's machine learning capabilities I converted my training data to Spark vectors (DenseVector or SparseVector). I have to do some arithmetic (addition, multiplication with scalar, dot product) on that data before I can feed it…

apache-spark apache-spark-mllib apache-spark-ml

asked Apr 19 '16 at 14:37

Jonathan

358
3
14

0

votes

2 answers

Spark: Use same OneHotEncoder on multiple dataframes

I have two DataFrames with the same columns and I want to convert a categorical column into a vector using One-Hot-Encoding. The problem is that for example, in the training set 3 unique values may occur while in the test set you may have less than…

python apache-spark dataframe pyspark apache-spark-ml

asked Apr 04 '16 at 14:04

ml_0x

302
1
3
18

0

votes

1 answer

How to do parallel pipeline?

I have built a scala application in Spark v.1.6.0 that actually combines various functionalities. I have code for scanning a dataframe for certain entries, I have code that performs certain computation on a dataframe, I have code for creating an…

scala apache-spark dataflow apache-spark-ml

asked Mar 30 '16 at 10:18

navige

2,447
3
27
53

0

votes

0 answers

NoSuchElementException in ChiSqSelector fit method (version 1.6.0)

I'm running into an error that's not making a lot of sense to me, and couldn't find sufficient info on the web to answer it myself. I've written code to generate a list of (String, ArrayBuffer[String]) pairs and then use HashingTF to convert the…

apache-spark feature-selection apache-spark-mllib chi-squared apache-spark-ml

asked Mar 27 '16 at 23:47

Josh Cason

528
5
9

0

votes

1 answer

How do i convert spark dataframe to RDD and get bag of words

apache-spark apache-spark-sql apache-spark-ml

asked Mar 10 '16 at 17:05

Krishna Kalyan

1,672
2
20
43

0

votes

1 answer

How to make binary classication in Spark ML without StringIndexer

I try to use Spark ML DecisionTreeClassifier in Pipeline without StringIndexer, because my feature is already indexed as (0.0; 1.0). DecisionTreeClassifier as label requires double values, so this code should work: def…

scala apache-spark classification apache-spark-sql apache-spark-ml

asked Mar 10 '16 at 16:37

Dmitry Spikhalskiy

5,379
1
26
40

0

votes

1 answer

Spark pipeline combining VectorAssembler and HashingTF transformers

Let's define a Spark pipeline that assembles a few columns together and then applies feature hashing: val df = sqlContext.createDataFrame(Seq((0.0, 1.0, 2.0), (3.0, 4.0, 5.0))).toDF("colx", "coly", "colz") val va = new…

apache-spark apache-spark-sql apache-spark-ml

asked Mar 01 '16 at 13:17

ranlot

636
1
6
14

0

votes

1 answer

Is it possible to save GBTClassifier in Spark 1.6?

I have trained a GBTClassifier in Spark 1.6 with the Pipeline abstraction and I am kind of confused on how to save it. If I do: GBTClassificationModel gbt = trainClassifierGBT(data); Model Accuracy = 0.8306451612903226 Test Error =…

apache-spark machine-learning save apache-spark-mllib apache-spark-ml

asked Mar 01 '16 at 10:34

Abdul Merzoug

31
3

0

votes

1 answer

pyspark add new column field with the data frame row number

Hy, I'm trying build a recommendation system with Spark I have a data frame with users email and movie rating. df = pd.DataFrame(np.array([["aa@gmail.com",2,3],["aa@gmail.com",5,5],["bb@gmail.com",8,2],["cc@gmail.com",9,3]]),…

python apache-spark pyspark apache-spark-mllib apache-spark-ml

asked Feb 03 '16 at 10:40

Kardu

865
3
13
24

0

votes

2 answers

Failed to load class for data source: Libsvm in spark ML pyspark/scala

When I try to import a libsvm file in pyspark/scala using "sqlContext.read.format("libsvm").load", I get the following error - "Failed to load class for data source: Libsvm." At the same time, if I use "MLUtils.loadLibSVMFile" it works perfectly…

apache-spark pyspark mapr apache-spark-mllib apache-spark-ml

asked Feb 02 '16 at 15:09

Mike

197
1
2
15

0

votes

1 answer

Spark Process Dataframe with Random Forest

Using the answer to Spark 1.5.1, MLLib Random Forest Probability, I was able train a random forest using ml.classification.RandomForestClassifier, and process a holdout dataframe with the trained random forest. The problem I have is that I would…

apache-spark apache-spark-sql apache-spark-mllib random-forest apache-spark-ml

asked Jan 24 '16 at 18:21

Benji Kok

322
2
4
17

0

votes

1 answer

Spark ML Pipeline api save not working

in version 1.6 the pipeline api got a new set of features to save and load pipeline stages. I tried to save a stage to disk after I trained a classifier and load it later again to reuse it and save the effort to compute to model again. For some…

java apache-spark apache-spark-ml

asked Jan 11 '16 at 23:13

Johnny000

2,058
5
30
59

0

votes

1 answer

tokenizer in spark dataframe API

Each row of a Spark dataframe df contains a tab-separated string in a column rawFV. I already know that splitting on the tab will yield an array of 3 strings for all the rows. This can be verified by: df.map(row =>…

scala apache-spark dataframe apache-spark-sql apache-spark-ml

asked Jan 06 '16 at 13:03

ranlot

636
1
6
14

0

votes

1 answer

Spark returns (LogisticRegression) model with scaled coefficients

I am testing the LogisticRegression performance on a synthetically generated data. The weights I have as input are w = [2, 3, 4] with no intercept and three features. After training on 1000 synthetically generated datapoint assuming random…

scala apache-spark apache-spark-ml

asked Nov 18 '15 at 00:08

Nikhil J Joshi

1,177
2
12
25

0

votes

1 answer

Where is the predict() in Logistic Regression of Spark MLLIb implemented?

Can somebody point me to the implementation of predict() in LogisticRegressionModel of spark mllib? I could find a predictPoint() in the class LogisticRegressionModel, but where is predict()?

apache-spark apache-spark-mllib apache-spark-ml

asked Nov 13 '15 at 09:02

Meethu Mathew

431
1
6
15

Questions tagged [apache-spark-ml]