Questions tagged [apache-spark-mllib]

MLlib is a machine learning library for Apache Spark

MLlib is a low level, RDD based machine learning library for Apache Spark

External links:

Related tags:

,

2241 questions
9
votes
2 answers

Polynomial regression in spark/ or external packages for spark

After investing good amount of searching on net for this topic, I am ending up here if I can get some pointer . please read further After analyzing Spark 2.0 I concluded polynomial regression is not possible with spark (spark alone), so is there…
sourabh
  • 223
  • 2
  • 13
9
votes
1 answer

Non linear (DAG) ML pipelines in Apache Spark

I've set-up a simple Spark-ML app, where I have a pipeline of independent transformers that add columns to a dataframe of raw data. Since the transformers don't look at the output of one another I was hoping I could run them in parallel in a…
hillel
  • 2,343
  • 2
  • 18
  • 25
9
votes
2 answers

How to fix "MetadataFetchFailedException: Missing an output location for shuffle"?

If I increase the model size of my word2vec model I start to get this kind of exception in my log: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 6 at…
Stefan Falk
  • 23,898
  • 50
  • 191
  • 378
9
votes
2 answers

How to convert type Row into Vector to feed to the KMeans

when i try to feed df2 to kmeans i get the following error clusters = KMeans.train(df2, 10, maxIterations=30, runs=10, initializationMode="random") The error i get: Cannot convert type into…
9
votes
1 answer

How to map variable names to features after pipeline

I have modified the OneHotEncoder example to actually train a LogisticRegression. My question is how to map the generated weights back to the categorical variables? def oneHotEncoderExample(sqlContext: SQLContext): Unit = { val df =…
lapolonio
  • 1,107
  • 2
  • 14
  • 24
9
votes
1 answer

How do I use Spark's Feature Importance on Random Forest?

The documentation for Random Forests does not include feature importances. However, it is listed on the Jira as resolved and is in the source code. HERE also says "The main differences between this API and the original MLlib ensembles API…
Climbs_lika_Spyder
  • 6,004
  • 3
  • 39
  • 53
9
votes
4 answers

PySpark & MLLib: Class Probabilities of Random Forest Predictions

I'm trying to extract the class probabilities of a random forest object I have trained using PySpark. However, I do not see an example of it anywhere in the documentation, nor is it a a method of RandomForestModel. How can I extract class…
Bryan
  • 5,999
  • 9
  • 29
  • 50
9
votes
1 answer

Spark MLLib TFIDF implementation for LogisticRegression

I try to use the new TFIDF algorithem that spark 1.1.0 offers. I'm writing my job for MLLib in Java but I can't figure out how to get the TFIDF implementation working. For some reason IDFModel only accepts a JavaRDD as input for the method transform…
Johnny000
  • 2,058
  • 5
  • 30
  • 59
8
votes
1 answer

Using Jackson 2.9.9 in java Spark

I am trying to use the MLLIB library (java) but one of my dependencies uses Jackson 2.9.9. I noticed that a pull request was made such that the master branch's dependency is upgraded to this particular version. Now I wanted to use this master branch…
Jasper
  • 628
  • 1
  • 9
  • 19
8
votes
1 answer

Failed to execute user defined function($anonfun$9: (string) => double) on using String Indexer for multiple columns

I am trying to apply string indexer on multiple columns. Here is my code val stringIndexers = Categorical_Model.map { colName =>new StringIndexer().setInputCol(colName).setOutputCol(colName + "_indexed")} var dfStringIndexed =…
Leothorn
  • 1,345
  • 1
  • 23
  • 45
8
votes
2 answers

Are random seeds compatible between systems?

I made a random forest model using python's sklearn package where I set the seed to for example to 1234. To productionise models, we use pyspark. If I was to pass the same hyperparmeters and same seed value, i.e. 1234, will it get the same…
8
votes
3 answers

Spark Java IllegalArgumentException at org.apache.xbean.asm5.ClassReader

I'm trying to use Spark 2.3.1 with Java. I followed examples in the documentation but keep getting poorly described exception when calling .fit(trainingData). Exception in thread "main" java.lang.IllegalArgumentException at…
8
votes
1 answer

How to set a custom loss function in Spark MLlib

I would like to use my own loss function instead of the squared loss for the linear regression model in spark MLlib. So far can't find any part in the documentation that mentions if it is even possible.
user4658980
8
votes
3 answers

convert dataframe to libsvm format

I have a dataframe resulting from a sql query df1 = sqlContext.sql("select * from table_test") I need to convert this dataframe to libsvm format so that it can be provided as an input for pyspark.ml.classification.LogisticRegression I tried to do…
sah.stc
  • 105
  • 2
  • 2
  • 8
8
votes
2 answers

How to use QuantileDiscretizer across groups in a DataFrame?

I have a DataFrame with the following columns. scala> show_times.printSchema root |-- account: string (nullable = true) |-- channel: string (nullable = true) |-- show_name: string (nullable = true) |-- total_time_watched: integer (nullable =…