Highest Voted 'apache-spark-ml' Questions

0

votes

2 answers

Scala flatMap filter elements in array instance of Type

I am curious how to filter the elements of an array in scala by class. case class FooBarGG(foo: Int, bar: String, baz: Option[String]) val df = Seq((1, "first", "A"), (1, "second", "A"), (2, "noValidFormat", "B"), (1, "lastAssumingSameDate",…

scala casting apache-spark-ml

asked Dec 12 '16 at 12:52

Georg Heiler

16,916
36
162
292

0

votes

1 answer

Formatting data for Spark ML

I'm new to spark and Spark ML. I'm generated some data with the function KMeansDataGenerator.generateKMeansRDD but I fail when formatting those so that it can then be used by an ML algorithm (here it's K-Means). The error is Exception in thread…

apache-spark apache-spark-sql rdd apache-spark-mllib apache-spark-ml

asked Dec 11 '16 at 18:41

galzra

31
7

0

votes

0 answers

spark ml verify prediction probability

We have built a text classification solution using Naive Bayes with decent prediction accuracy. In cases , where prediction has failed , we are displaying the prediction probability and also we are manually pulling all matching text from the…

apache-spark apache-spark-mllib naivebayes apache-spark-ml

asked Oct 15 '16 at 07:44

lives

1,243
5
25
61

0

votes

1 answer

how to deal with hundreds of colums data from textfile when training a model using spark ml

I have a textfile with hundreds of columns , but the columns don't have column names. The first column is the label and the others are features. I've read some examples that must specify cloumn names for the train data. But it is quite troublesome…

apache-spark apache-spark-mllib apache-spark-ml

asked Oct 13 '16 at 12:34

April

819
2
12
23

0

votes

1 answer

Using VectorAssembler in Spark

I got the following dataframe (it is assumed that it is already a dataframe): val df = sc.parallelize(Seq((1, 2, 10), (3, 4, 11), (5, 6, 12))) .toDF("a", "b", "c") and I want to combine the columns(not all) to one column and make it an…

apache-spark apache-spark-ml

asked Sep 22 '16 at 17:28

Mpizos Dimitris

4,819
12
58
100

0

votes

1 answer

PySpark: convert RDD[DenseVector] to dataframe

I have the following RDD: rdd.take(5) gives me: [DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699]), DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699]), DenseVector([5.0, 20.0, 0.3444, 0.3295,…

python pyspark apache-spark-sql apache-spark-mllib apache-spark-ml

asked Sep 17 '16 at 00:00

Edamame

23,718
73
186
320

0

votes

3 answers

pyspark: creating a k-means clustering model using spark-ml with spark data frame

I am using the following code to create a clustering model: import pandas as pd pandas_df = pd.read_pickle('df_features.pickle') spark_df = sqlContext.createDataFrame(pandas_df) from pyspark.ml.linalg import Vectors from pyspark.ml.clustering…

pandas apache-spark pyspark apache-spark-sql apache-spark-ml

asked Sep 16 '16 at 22:30

Edamame

23,718
73
186
320

0

votes

1 answer

How to train SparkML gradient boosting classifer given a RDD

Given the following rdd training_rdd = rdd.select( # Categorical features col('device_os'), # 'ios', 'android' # Numeric features col('30day_click_count'), col('30day_impression_count'), …

apache-spark pyspark apache-spark-ml

asked Sep 15 '16 at 06:53

samol

18,950
32
88
127

0

votes

1 answer

How can I use pyspark.mllib rdd api metric to measure pyspark.ml (new dataframe api)?

The old API of MlLib has evaluation metric class: https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html However, the new dataframe API does NOT have such a class: https://spark.apache.org/docs/latest/ml-guide.html It has the Evaluator…

pyspark apache-spark-mllib apache-spark-ml

asked Sep 06 '16 at 10:19

Hanan Shteingart

8,480
10
53
66

0

votes

1 answer

Executor heartbeat timed out Spark on DataProc

I am trying to fit a ml model in Spark (2.0.0) on a Google DataProc Cluster. When fitting the model I receive an Executor heartbeat timed out error. How can I resolve this? Other solutions indicate this is probably due to Out of Memory of (one of)…

apache-spark apache-spark-ml google-cloud-dataproc

asked Sep 03 '16 at 16:16

Stijn

459
2
8
18

0

votes

0 answers

Logistic Regression as multiclass classification using PySpark and issues

I am trying to use Logistic Regression to classify the datasets which has Sparse Vector in feature vector: Case 1: I tried using the pipeline of ML in MLLIB as follow: # imported library from ML from pyspark.ml.feature import HashingTF from…

pyspark apache-spark-mllib logistic-regression apache-spark-sql apache-spark-ml

asked Aug 26 '16 at 11:56

krishna Prasad

3,541
1
34
44

0

votes

1 answer

Can I measure the paralleliation performance of ML.api in Spark

In general, I want to compare the computing time between a large dataset and split datasets in Spark with the same learning algorithm. The other reason is that I want to get the partition model results. However, the result shows that the original…

scala apache-spark machine-learning apache-spark-ml

asked Aug 25 '16 at 10:38

Martin TT

301
2
16

0

votes

1 answer

IllegalArgumentException: u'requirement failed: Invalid initial capacity' in Spark on Google DataProc

I am currently trying to run a ml decision tree on a large dataset (30 mio observations, 13 variables) in Spark 2.0.0 on Google DataProc. When I execute: labelIndexer = StringIndexer(inputCol="Target", outputCol="indexedLabel").fit(data) I receive…

apache-spark apache-spark-ml google-cloud-dataproc

asked Aug 25 '16 at 07:54

Stijn

459
2
8
18

0

votes

1 answer

I am running GBT in Spark ML for CTR prediction. I am getting exception because of MaxBin Parameter

Exception details : Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: DecisionTree requires maxBins (= 32) to be at least as large as the number of values in each categorical feature, but categorical feature…

java machine-learning apache-spark-sql decision-tree apache-spark-ml

asked Aug 22 '16 at 07:29

cody123

2,040
24
29

0

votes

1 answer

Trying to apply GBT on a set of data getting ClassCastException

I am getting "Exception in thread "main" java.lang.ClassCastException: org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to org.apache.spark.ml.attribute.NominalAttribute". Source code package com.spark.lograthmicregression; import…

apache-spark apache-spark-sql apache-spark-ml

asked Aug 20 '16 at 03:29

cody123

2,040
24
29

Questions tagged [apache-spark-ml]