Highest Voted 'apache-spark-mllib' Questions

13

votes

1 answer

how to add a Incremental column ID for a table in spark SQL

I'm working on a spark mllib algorithm. The dataset I have is in this form Company":"XXXX","CurrentTitle":"XYZ","Edu_Title":"ABC","Exp_mnth":.(there are more values similar to these) Im trying to raw code String values to Numeric values. So, I…

apache-spark apache-spark-sql apache-spark-mllib

asked Jul 14 '16 at 14:36

KM-Yash

133
1
1
6

13

votes

3 answers

How do I convert an RDD with a SparseVector Column to a DataFrame with a column as Vector

I have an RDD with a tuple of values (String, SparseVector) and I want to create a DataFrame using the RDD. To get a (label:string, features:vector) DataFrame which is the Schema required by most of the ml algorithm's libraries. I know it can be…

apache-spark pyspark apache-spark-sql apache-spark-mllib apache-spark-ml

asked Sep 23 '15 at 16:47

Orangel Marquez

375
2
12

13

votes

4 answers

Error ExecutorLostFailure when running a task in Spark

when I am trying to run it on this folder it is throwing me ExecutorLostFailure everytime Hi I am a beginner in Spark. I am trying to run a job on Spark 1.4.1 with 8 slave nodes with 11.7 GB memory each 3.2 GB Disk . I am running the Spark task from…

apache-spark pyspark apache-spark-mllib collect

asked Jul 21 '15 at 02:51

User17

131
1
1
4

13

votes

4 answers

How to use mllib.recommendation if the user ids are string instead of contiguous integers?

I want to use Spark's mllib.recommendation library to build a prototype recommender system. However, the format of the user data I have is something of the following format: AB123XY45678 CD234WZ12345 EF345OOO1234 GH456XY98765 .... If I want to use…

apache-spark recommendation-engine apache-spark-mllib

asked Jan 05 '15 at 02:46

shihpeng

5,283
6
37
63

12

votes

3 answers

How to use XGboost in PySpark Pipeline

I want to update my code of pyspark. In the pyspark, it must put the base model in a pipeline, the office demo of pipeline use the LogistictRegression as an base model. However, it seems not be able to use XGboost model in the pipeline api. How can…

apache-spark pyspark apache-spark-mllib xgboost apache-spark-ml

asked May 30 '18 at 10:26

Daniel Du

121
1
1
4

12

votes

3 answers

How to overwrite Spark ML model in PySpark?

from pyspark.ml.regression import RandomForestRegressionModel rf = RandomForestRegressor(labelCol="label",featuresCol="features", numTrees=5, maxDepth=10, seed=42) rf_model = rf.fit(train_df) rf_model_path = "./hdfsData/" +…

apache-spark machine-learning pyspark apache-spark-mllib apache-spark-ml

asked Feb 17 '17 at 17:12

Veronica Cheng

447
4
13

12

votes

1 answer

Spark K-fold Cross Validation

I’m having some trouble understanding Spark’s cross validation. Any example I have seen uses it for parameter tuning, but I assumed that it would just do regular K-fold cross validation as well? What I want to do is to perform k-fold cross…

machine-learning classification apache-spark-mllib cross-validation

asked Jun 20 '16 at 09:43

other15

839
2
11
23

12

votes

4 answers

DBSCAN on spark : which implementation

I would like to do some DBSCAN on Spark. I have currently found 2 implementations: https://github.com/irvingc/dbscan-on-spark https://github.com/alitouka/spark_dbscan I have tested the first one with the sbt configuration given in its github but:…

scala apache-spark cluster-analysis apache-spark-mllib dbscan

asked Mar 18 '16 at 17:39

Benjamin

3,350
4
24
49

12

votes

1 answer

Proper save/load of MatrixFactorizationModel

I have MatrixFactorizationModel object. If I'm trying to recommend products to single user right after constructing model through ALS.train(...) then it takes 300ms (for my data and hardware). But if I save model to disk and load it back then…

apache-spark apache-spark-mllib

asked Jul 17 '15 at 15:22

Osmin

426
3
12

12

votes

3 answers

Using DataFrame with MLlib

Let's say I have a DataFrame (that I read in from a csv on HDFS) and I want to train some algorithms on it via MLlib. How do I convert the rows into LabeledPoints or otherwise utilize MLlib on this dataset?

apache-spark apache-spark-mllib

asked Mar 31 '15 at 20:17

kevinykuo

4,600
5
23
31

11

votes

1 answer

Calculate Cosine Similarity Spark Dataframe

I am using Spark Scala to calculate cosine similarity between the Dataframe rows. Dataframe format is below root |-- SKU: double (nullable = true) |-- Features: vector (nullable = true) Sample of the dataframe below …

scala apache-spark apache-spark-sql apache-spark-mllib

asked Oct 30 '17 at 07:38

Moustafa Mahmoud

1,540
13
35

11

votes

1 answer

ALS model - predicted full_u * v^t * v ratings are very high

I'm predicting ratings in between processes that batch train the model. I'm using the approach outlined here: ALS model - how to generate full_u * v^t * v? ! rm -rf ml-1m.zip ml-1m ! wget --quiet…

apache-spark apache-spark-mllib apache-spark-ml

asked Jan 10 '17 at 12:32

Chris Snow

23,813
35
144
309

11

votes

2 answers

How to convert RDD of dense vector into DataFrame in pyspark?

I have a DenseVector RDD like this >>> frequencyDenseVectors.collect() [DenseVector([1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0]), DenseVector([1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([1.0,…

apache-spark pyspark apache-spark-mllib apache-spark-ml apache-spark-2.0

asked Dec 26 '16 at 09:05

Hardik Gupta

4,700
9
41
83

11

votes

2 answers

PySpark: How to evaluate AUC of ML recomendation algorithm?

I have a Spark Dataframe as below: predictions.show(5) +------+----+------+-----------+ | user|item|rating| prediction| +------+----+------+-----------+ |379433| 31| 1| 0.08203495| | 1834| 31| 1| 0.4854447| |422635| 31| …

python apache-spark pyspark apache-spark-mllib apache-spark-ml

asked Nov 01 '16 at 18:45

Baktaawar

7,086
24
81
149

11

votes

2 answers

Is Spark's KMeans unable to handle bigdata?

KMeans has several parameters for its training, with initialization mode defaulted to kmeans||. The problem is that it marches quickly (less than 10min) to the first 13 stages, but then hangs completely, without yielding an error! Minimal Example…

python apache-spark k-means apache-spark-mllib bigdata

asked Sep 01 '16 at 00:05

gsamaras

71,951
46
188
305

Questions tagged [apache-spark-mllib]