Questions tagged [apache-spark-ml]

Spark ML is a high-level API for building machine learning pipelines in Apache Spark.

Related tags: , ,

External resources:

925 questions
0
votes
0 answers

Doing arithmetic with Apache Spark's Vector

In order to use Spark's machine learning capabilities I converted my training data to Spark vectors (DenseVector or SparseVector). I have to do some arithmetic (addition, multiplication with scalar, dot product) on that data before I can feed it…
Jonathan
  • 358
  • 3
  • 14
0
votes
2 answers

Spark: Use same OneHotEncoder on multiple dataframes

I have two DataFrames with the same columns and I want to convert a categorical column into a vector using One-Hot-Encoding. The problem is that for example, in the training set 3 unique values may occur while in the test set you may have less than…
ml_0x
  • 302
  • 1
  • 3
  • 18
0
votes
1 answer

How to do parallel pipeline?

I have built a scala application in Spark v.1.6.0 that actually combines various functionalities. I have code for scanning a dataframe for certain entries, I have code that performs certain computation on a dataframe, I have code for creating an…
navige
  • 2,447
  • 3
  • 27
  • 53
0
votes
0 answers

NoSuchElementException in ChiSqSelector fit method (version 1.6.0)

I'm running into an error that's not making a lot of sense to me, and couldn't find sufficient info on the web to answer it myself. I've written code to generate a list of (String, ArrayBuffer[String]) pairs and then use HashingTF to convert the…
0
votes
1 answer

How do i convert spark dataframe to RDD and get bag of words

I have a dataframe called article +--------------------+ | processed_title| +--------------------+ |[new, relictual, ...| |[once, upon,a,time..| +--------------------+ I want to flatten it to get it as bag of words. How could I achieve this…
Krishna Kalyan
  • 1,672
  • 2
  • 20
  • 43
0
votes
1 answer

How to make binary classication in Spark ML without StringIndexer

I try to use Spark ML DecisionTreeClassifier in Pipeline without StringIndexer, because my feature is already indexed as (0.0; 1.0). DecisionTreeClassifier as label requires double values, so this code should work: def…
0
votes
1 answer

Spark pipeline combining VectorAssembler and HashingTF transformers

Let's define a Spark pipeline that assembles a few columns together and then applies feature hashing: val df = sqlContext.createDataFrame(Seq((0.0, 1.0, 2.0), (3.0, 4.0, 5.0))).toDF("colx", "coly", "colz") val va = new…
ranlot
  • 636
  • 1
  • 6
  • 14
0
votes
1 answer

Is it possible to save GBTClassifier in Spark 1.6?

I have trained a GBTClassifier in Spark 1.6 with the Pipeline abstraction and I am kind of confused on how to save it. If I do: GBTClassificationModel gbt = trainClassifierGBT(data); Model Accuracy = 0.8306451612903226 Test Error =…
0
votes
1 answer

pyspark add new column field with the data frame row number

Hy, I'm trying build a recommendation system with Spark I have a data frame with users email and movie rating. df = pd.DataFrame(np.array([["aa@gmail.com",2,3],["aa@gmail.com",5,5],["bb@gmail.com",8,2],["cc@gmail.com",9,3]]),…
Kardu
  • 865
  • 3
  • 13
  • 24
0
votes
2 answers

Failed to load class for data source: Libsvm in spark ML pyspark/scala

When I try to import a libsvm file in pyspark/scala using "sqlContext.read.format("libsvm").load", I get the following error - "Failed to load class for data source: Libsvm." At the same time, if I use "MLUtils.loadLibSVMFile" it works perfectly…
Mike
  • 197
  • 1
  • 2
  • 15
0
votes
1 answer

Spark Process Dataframe with Random Forest

Using the answer to Spark 1.5.1, MLLib Random Forest Probability, I was able train a random forest using ml.classification.RandomForestClassifier, and process a holdout dataframe with the trained random forest. The problem I have is that I would…
0
votes
1 answer

Spark ML Pipeline api save not working

in version 1.6 the pipeline api got a new set of features to save and load pipeline stages. I tried to save a stage to disk after I trained a classifier and load it later again to reuse it and save the effort to compute to model again. For some…
Johnny000
  • 2,058
  • 5
  • 30
  • 59
0
votes
1 answer

tokenizer in spark dataframe API

Each row of a Spark dataframe df contains a tab-separated string in a column rawFV. I already know that splitting on the tab will yield an array of 3 strings for all the rows. This can be verified by: df.map(row =>…
ranlot
  • 636
  • 1
  • 6
  • 14
0
votes
1 answer

Spark returns (LogisticRegression) model with scaled coefficients

I am testing the LogisticRegression performance on a synthetically generated data. The weights I have as input are w = [2, 3, 4] with no intercept and three features. After training on 1000 synthetically generated datapoint assuming random…
Nikhil J Joshi
  • 1,177
  • 2
  • 12
  • 25
0
votes
1 answer

Where is the predict() in Logistic Regression of Spark MLLIb implemented?

Can somebody point me to the implementation of predict() in LogisticRegressionModel of spark mllib? I could find a predictPoint() in the class LogisticRegressionModel, but where is predict()?
Meethu Mathew
  • 431
  • 1
  • 6
  • 15