Questions tagged [apache-spark-mllib]

MLlib is a machine learning library for Apache Spark

MLlib is a low level, RDD based machine learning library for Apache Spark

External links:

Related tags:

,

2241 questions
1
vote
2 answers

Using SVD in pyspark

I am having a huge list of names-surnames and I am trying to merge them. For example 'Michael Jordan' with Jordan Michael. I am doing the following procedure using pyspark: Calculate tfidf -> compute cos similarity -> convert to sparse…
Mpizos Dimitris
  • 4,819
  • 12
  • 58
  • 100
1
vote
1 answer

Spark MLLib LogisticRegression debug model?

I'm working on a LogisticRegression model and trying to debug. It's a simple thing but can't seem to get it to work: just have time of day and a state 0 or 1, and want to predict the state for a given time of day. There are no errors when training…
MrE
  • 19,584
  • 12
  • 87
  • 105
1
vote
0 answers

Spark ALS-WR giving the same recommended items for all users

We are trying to build a recommendation system for a supermarket with diverse item types (ranging from fast-moving grocery to low-moving electronic items). Some items are purchased more frequently in high volume and some items are purchased only…
1
vote
1 answer

Spark MLlib RowMatrix from SparseVector

I am trying to create a RowMatrix from an RDD of SparseVectors but am getting the following error: :37: error: type mismatch; found : dataRows.type (with underlying type…
Ryan
  • 72
  • 5
1
vote
1 answer

Spark 1.6.0 DenseMatrix update values

There was Update method in Spark 1.3.1 https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/mllib/linalg/DenseMatrix.html But in Spark 1.6.0, there is no Update method…
purplebee
  • 61
  • 6
1
vote
0 answers

Spark: Frequent Pattern Mining: issues in saving the results

I am using Spark's FP-growth algorithm. I was getting OOM errors when I was doing a collect, I then changed the code so that I can save the results in a text file on HDFS rather than collecting them on the driver node. Here is the related code: //…
user3803714
  • 5,269
  • 10
  • 42
  • 61
1
vote
1 answer

sbt: using local jar without breaking the dependencies

I am building an application that uses Spark and Spark-mllib, the build.sbt states the dependencies as followings: 3 libraryDependencies ++= Seq( 4 "org.apache.spark" %% "spark-core" % "1.6.0" withSources() withJavadoc(), 5 …
Xiangyu
  • 824
  • 9
  • 34
1
vote
3 answers

Convert a JavaRDD String to JavaRDD Vector

I'm trying to load a csv file as a JavaRDD String and then want to get the data in JavaRDD Vector import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import…
Anshul Kalra
  • 198
  • 3
  • 13
1
vote
0 answers

Size of a random forest model in MLlib

I have to compute and to keep in memory several (e.g. 20 or more) random forests model with Apache Spark. I have only 8 GB available on the driver of the yarn cluster I use to launch the job. And I am faced to OutOfMemory errors because models do…
Pop
  • 12,135
  • 5
  • 55
  • 68
1
vote
1 answer

MLlib model (RandomForestModel) saves model with numerous small parquet files

I'm trying to train the MLlib RandomForestRegression Model using the RandomForest.trainRegressor API. After training, when I try to save the model the resulting model folder has a size of 6.5MB on disk, but there are 1120 small parquet files in the…
x89a10
  • 681
  • 1
  • 8
  • 23
1
vote
1 answer

Spark Feature Vector Transformation on Pre-Sorted Input

I have some data in a tab-delimited file on HDFS that looks like this: label | user_id | feature ------------------------------ pos | 111 | www.abc.com pos | 111 | www.xyz.com pos | 111 | Firefox pos | 222 | www.example.com …
Larsenal
  • 49,878
  • 43
  • 152
  • 220
1
vote
2 answers

Is my application running efficiently?

The question is generic and can be extended to other frameworks or contexts beyond Spark & Machine Learning algorithms. Regardless of the details, from a high-level point-of-view, the code is applied on a large dataset of labeled text documents. It…
Rami
  • 8,044
  • 18
  • 66
  • 108
1
vote
2 answers

Document classification in spark mllib

i want to classify documents if they belong to sports, entertainment, politics. i have created a bag of words which output somthing like : (1, 'saurashtra') (1, 'saumyajit') (1, 'satyendra') i want to implement naive bayes algorithm for…
1
vote
0 answers

Spark MLlib and Rest

I was working a proof of concept of having Spark Mllib training and prediction serving exposed for multiple tenants with some form of a REST interface. I did get a POC up and running but it seems a bit wasteful as it has to create numerous spark…
Feras
  • 2,114
  • 3
  • 20
  • 42
1
vote
2 answers

Spark job using too many resources

I am launching a cross-validation study on 50 containers of a yarn cluster. The data are bout 600,000 lines. The job works well most of the time but uses a lot of RAM and CPU resources on the driver server of the cluster (the machine where the job…
Pop
  • 12,135
  • 5
  • 55
  • 68