Highest Voted 'apache-spark-mllib' Questions

1

vote

2 answers

Using SVD in pyspark

I am having a huge list of names-surnames and I am trying to merge them. For example 'Michael Jordan' with Jordan Michael. I am doing the following procedure using pyspark: Calculate tfidf -> compute cos similarity -> convert to sparse…

apache-spark pyspark tf-idf svd apache-spark-mllib

asked Feb 12 '16 at 17:00

Mpizos Dimitris

4,819
12
58
100

1

vote

1 answer

Spark MLLib LogisticRegression debug model?

I'm working on a LogisticRegression model and trying to debug. It's a simple thing but can't seem to get it to work: just have time of day and a state 0 or 1, and want to predict the state for a given time of day. There are no errors when training…

java apache-spark apache-spark-mllib

asked Feb 10 '16 at 19:54

MrE

19,584
12
87
105

1

vote

0 answers

Spark ALS-WR giving the same recommended items for all users

We are trying to build a recommendation system for a supermarket with diverse item types (ranging from fast-moving grocery to low-moving electronic items). Some items are purchased more frequently in high volume and some items are purchased only…

apache-spark machine-learning collaborative-filtering apache-spark-mllib data-science

asked Feb 10 '16 at 13:14

samdas

11
2

1

vote

1 answer

Spark MLlib RowMatrix from SparseVector

I am trying to create a RowMatrix from an RDD of SparseVectors but am getting the following error: :37: error: type mismatch; found : dataRows.type (with underlying type…

apache-spark sparse-matrix svd apache-spark-mllib

asked Feb 06 '16 at 08:53

Ryan

72
5

1

vote

1 answer

Spark 1.6.0 DenseMatrix update values

There was Update method in Spark 1.3.1 https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/mllib/linalg/DenseMatrix.html But in Spark 1.6.0, there is no Update method…

matrix apache-spark apache-spark-mllib

asked Feb 05 '16 at 11:39

purplebee

61
6

1

vote

0 answers

Spark: Frequent Pattern Mining: issues in saving the results

I am using Spark's FP-growth algorithm. I was getting OOM errors when I was doing a collect, I then changed the code so that I can save the results in a text file on HDFS rather than collecting them on the driver node. Here is the related code: //…

apache-spark apache-spark-mllib

asked Feb 04 '16 at 22:56

user3803714

5,269
10
42
61

1

vote

1 answer

sbt: using local jar without breaking the dependencies

I am building an application that uses Spark and Spark-mllib, the build.sbt states the dependencies as followings: 3 libraryDependencies ++= Seq( 4 "org.apache.spark" %% "spark-core" % "1.6.0" withSources() withJavadoc(), 5 …

apache-spark sbt apache-spark-mllib

asked Feb 03 '16 at 19:06

Xiangyu

824
9
34

1

vote

3 answers

Convert a JavaRDD String to JavaRDD Vector

I'm trying to load a csv file as a JavaRDD String and then want to get the data in JavaRDD Vector import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import…

java apache-spark apache-spark-mllib

asked Feb 03 '16 at 10:51

Anshul Kalra

198
3
13

1

vote

0 answers

Size of a random forest model in MLlib

I have to compute and to keep in memory several (e.g. 20 or more) random forests model with Apache Spark. I have only 8 GB available on the driver of the yarn cluster I use to launch the job. And I am faced to OutOfMemory errors because models do…

scala apache-spark random-forest apache-spark-mllib

asked Feb 02 '16 at 09:31

Pop

12,135
5
55
68

1

vote

1 answer

MLlib model (RandomForestModel) saves model with numerous small parquet files

I'm trying to train the MLlib RandomForestRegression Model using the RandomForest.trainRegressor API. After training, when I try to save the model the resulting model folder has a size of 6.5MB on disk, but there are 1120 small parquet files in the…

apache-spark parquet apache-spark-mllib

asked Jan 30 '16 at 19:32

x89a10

681
1
8
23

1

vote

1 answer

Spark Feature Vector Transformation on Pre-Sorted Input

apache-spark apache-spark-mllib naivebayes

asked Jan 26 '16 at 18:13

Larsenal

49,878
43
152
220

1

vote

2 answers

Is my application running efficiently?

The question is generic and can be extended to other frameworks or contexts beyond Spark & Machine Learning algorithms. Regardless of the details, from a high-level point-of-view, the code is applied on a large dataset of labeled text documents. It…

performance apache-spark apache-spark-mllib

asked Jan 19 '16 at 08:43

Rami

8,044
18
66
108

1

vote

2 answers

Document classification in spark mllib

i want to classify documents if they belong to sports, entertainment, politics. i have created a bag of words which output somthing like : (1, 'saurashtra') (1, 'saumyajit') (1, 'satyendra') i want to implement naive bayes algorithm for…

python apache-spark-mllib naivebayes document-classification

asked Jan 16 '16 at 13:25

decipher

498
2
4
16

1

vote

0 answers

Spark MLlib and Rest

I was working a proof of concept of having Spark Mllib training and prediction serving exposed for multiple tenants with some form of a REST interface. I did get a POC up and running but it seems a bit wasteful as it has to create numerous spark…

rest spring-mvc apache-spark apache-spark-mllib

asked Jan 15 '16 at 23:08

Feras

2,114
3
20
42

1

vote

2 answers

Spark job using too many resources

I am launching a cross-validation study on 50 containers of a yarn cluster. The data are bout 600,000 lines. The job works well most of the time but uses a lot of RAM and CPU resources on the driver server of the cluster (the machine where the job…

scala apache-spark apache-spark-mllib

asked Jan 11 '16 at 07:45

Pop

12,135
5
55
68

Questions tagged [apache-spark-mllib]