Questions tagged [apache-spark-ml]

Spark ML is a high-level API for building machine learning pipelines in Apache Spark.

Related tags: , ,

External resources:

925 questions
0
votes
2 answers

Scala flatMap filter elements in array instance of Type

I am curious how to filter the elements of an array in scala by class. case class FooBarGG(foo: Int, bar: String, baz: Option[String]) val df = Seq((1, "first", "A"), (1, "second", "A"), (2, "noValidFormat", "B"), (1, "lastAssumingSameDate",…
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
0
votes
1 answer

Formatting data for Spark ML

I'm new to spark and Spark ML. I'm generated some data with the function KMeansDataGenerator.generateKMeansRDD but I fail when formatting those so that it can then be used by an ML algorithm (here it's K-Means). The error is Exception in thread…
0
votes
0 answers

spark ml verify prediction probability

We have built a text classification solution using Naive Bayes with decent prediction accuracy. In cases , where prediction has failed , we are displaying the prediction probability and also we are manually pulling all matching text from the…
lives
  • 1,243
  • 5
  • 25
  • 61
0
votes
1 answer

how to deal with hundreds of colums data from textfile when training a model using spark ml

I have a textfile with hundreds of columns , but the columns don't have column names. The first column is the label and the others are features. I've read some examples that must specify cloumn names for the train data. But it is quite troublesome…
April
  • 819
  • 2
  • 12
  • 23
0
votes
1 answer

Using VectorAssembler in Spark

I got the following dataframe (it is assumed that it is already a dataframe): val df = sc.parallelize(Seq((1, 2, 10), (3, 4, 11), (5, 6, 12))) .toDF("a", "b", "c") and I want to combine the columns(not all) to one column and make it an…
Mpizos Dimitris
  • 4,819
  • 12
  • 58
  • 100
0
votes
1 answer

PySpark: convert RDD[DenseVector] to dataframe

I have the following RDD: rdd.take(5) gives me: [DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699]), DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699]), DenseVector([5.0, 20.0, 0.3444, 0.3295,…
Edamame
  • 23,718
  • 73
  • 186
  • 320
0
votes
3 answers

pyspark: creating a k-means clustering model using spark-ml with spark data frame

I am using the following code to create a clustering model: import pandas as pd pandas_df = pd.read_pickle('df_features.pickle') spark_df = sqlContext.createDataFrame(pandas_df) from pyspark.ml.linalg import Vectors from pyspark.ml.clustering…
Edamame
  • 23,718
  • 73
  • 186
  • 320
0
votes
1 answer

How to train SparkML gradient boosting classifer given a RDD

Given the following rdd training_rdd = rdd.select( # Categorical features col('device_os'), # 'ios', 'android' # Numeric features col('30day_click_count'), col('30day_impression_count'), …
samol
  • 18,950
  • 32
  • 88
  • 127
0
votes
1 answer

How can I use pyspark.mllib rdd api metric to measure pyspark.ml (new dataframe api)?

The old API of MlLib has evaluation metric class: https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html However, the new dataframe API does NOT have such a class: https://spark.apache.org/docs/latest/ml-guide.html It has the Evaluator…
Hanan Shteingart
  • 8,480
  • 10
  • 53
  • 66
0
votes
1 answer

Executor heartbeat timed out Spark on DataProc

I am trying to fit a ml model in Spark (2.0.0) on a Google DataProc Cluster. When fitting the model I receive an Executor heartbeat timed out error. How can I resolve this? Other solutions indicate this is probably due to Out of Memory of (one of)…
Stijn
  • 459
  • 2
  • 8
  • 18
0
votes
0 answers

Logistic Regression as multiclass classification using PySpark and issues

I am trying to use Logistic Regression to classify the datasets which has Sparse Vector in feature vector: Case 1: I tried using the pipeline of ML in MLLIB as follow: # imported library from ML from pyspark.ml.feature import HashingTF from…
0
votes
1 answer

Can I measure the paralleliation performance of ML.api in Spark

In general, I want to compare the computing time between a large dataset and split datasets in Spark with the same learning algorithm. The other reason is that I want to get the partition model results. However, the result shows that the original…
Martin TT
  • 301
  • 2
  • 16
0
votes
1 answer

IllegalArgumentException: u'requirement failed: Invalid initial capacity' in Spark on Google DataProc

I am currently trying to run a ml decision tree on a large dataset (30 mio observations, 13 variables) in Spark 2.0.0 on Google DataProc. When I execute: labelIndexer = StringIndexer(inputCol="Target", outputCol="indexedLabel").fit(data) I receive…
Stijn
  • 459
  • 2
  • 8
  • 18
0
votes
1 answer

I am running GBT in Spark ML for CTR prediction. I am getting exception because of MaxBin Parameter

Exception details : Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: DecisionTree requires maxBins (= 32) to be at least as large as the number of values in each categorical feature, but categorical feature…
0
votes
1 answer

Trying to apply GBT on a set of data getting ClassCastException

I am getting "Exception in thread "main" java.lang.ClassCastException: org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to org.apache.spark.ml.attribute.NominalAttribute". Source code package com.spark.lograthmicregression; import…
cody123
  • 2,040
  • 24
  • 29