Questions tagged [apache-spark-mllib]

MLlib is a machine learning library for Apache Spark

MLlib is a low level, RDD based machine learning library for Apache Spark

External links:

Related tags:

,

2241 questions
1
vote
1 answer

how to convert a PythonRDD with sparse data into dense PythonRDD

I want to use StandardScaler to scale the data. I've loaded the data into a PythonRDD. It seems the data is sparse. To apply StandardScaler, we should first convert it into dense types. trainData = MLUtils.loadLibSVMFile(sc, trainDataPath) valData…
mining
  • 3,557
  • 5
  • 39
  • 66
1
vote
1 answer

Best way to build LabeledPoint of Features for Apache Spark MLlib in Java

I am preparing Data that contains Ids (Labels) and Keywords (Features) to pass them to MLlib algorithms, in Java. My keywords are strings separated with commas. My goal is to use multiclass classification algorithms to predict the id. The question…
Sparkan
  • 139
  • 1
  • 13
1
vote
2 answers

finding Bigrams in spark with java(8)

I have tokenized the sentences into word RDD. so now i need Bigrams. ex. This is my test => (This is), (is my), (my test) I have search thru and found .sliding operator for this purpose. But I'm not getting this option on my eclipse (may it is…
insomniac
  • 155
  • 2
  • 15
1
vote
1 answer

The argument types of an anonymous function must be fully known. (SLS 8.5) when word2vec applied for dataframe

I apply Spark's word2vec by using a dataframe. Here is my code: val df2 = df.groupBy("LABEL").agg(collect_list("TERM").alias("TERM")) val word2Vec = new Word2Vec() .setInputCol("TERM") .setOutputCol("result") …
mlee_jordan
  • 772
  • 4
  • 18
  • 50
1
vote
3 answers

Training Sparks word2vec with a RDD[String]

I'm new to Spark and Scala so I might have misunderstood some basic things here. I'm trying to train Sparks word2vec model on my own data. According to their documentation, one way to do this is val input = sc.textFile("text8").map(line =>…
burk
  • 345
  • 4
  • 16
1
vote
1 answer

/usr/bin/time CPU utilization against TOP while using SPARK

I ran a SVM algorithm using MLIB library in Spark on a data of size 8G, and 7 million rows. I am running Spark in standalone mode on a single node. I used /usr/bin/time -v to capture data about the job. I got the peak memory utilization, and % CPU…
1
vote
0 answers

true negative is 0% whereas true positive is 100% correctly classified

I used Naive Bayes from Spark's MlLib to train a model and test it on the data (in the form of an RDD). The results were confusing. the data and results are as follows: The problem is a binary classification one. The outcome should be either a label…
1
vote
0 answers

Streaming Clustering with Unknown Number of Clusters

I need to classify a number of data points that will arrive in time. Streaming K-Means would be fine, if I only knew how many different classes (clusters) I might find on my data points. Is there any way to use Spark MLlib 'out of the box' to run a…
1
vote
0 answers

Elegant way to unfold RDD[LablePoint] to DataFrame something like RowMatrix

I have libsvm with (lable, sparse Vector) which I can load to RDD[LablePoint]. I am wondering if there is an elegant way to convert to a dataframe that holds unfolded something like RowMatrix in scala.
1
vote
0 answers

Spark mllib Collaborative Filtering, ValueError: RDD is empty

I'm new to Spark and am running the implicit collaborative fitering from here mllib. When I run the following code on my data, I'm getting the following error: ValueError: RDD is empty Here is my data: 101,1000010,1 101,1000011,1 …
jKraut
  • 2,325
  • 6
  • 35
  • 48
1
vote
1 answer

Spark Latent Dirichlet Allocation model topic matrix is too small

First, just in case, I will explain how I represented the documents that I want to run the LDA model on. First, I do some preprocessing to get the most important terms per a person for all their documents, then I get the union of all the most…
Jake Fund
  • 395
  • 2
  • 6
  • 16
1
vote
0 answers

NaN in input vector for MLlib algorithms

I want to cluster my data using spark's MLlib functions. The problem is that in my dataset sometimes I get NULL as the features value. I can't write 0.0 instead of it since it's just wrong. So I tried using Double.NaN for the value. This doesn't…
antonpuz
  • 3,256
  • 4
  • 25
  • 48
1
vote
1 answer

Apache Spark TFIDF using Python

The Spark documentation states to use HashingTF feature, but I'm unsure what the transform function expects as input. http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf I tried running the tutorial code: from pyspark import…
1
vote
0 answers

More information on Clusters generated using K-Means Clustering algorithm in Spark MLIB

I'm seeking help to know more information on the clusters generated using K-Means clustering algorithm in Spark MLIB. By the end of the below code snippet, we have a K-Means Model in the value clusters. val data = List((0.0, 0.0, 0.0),(0.1, 0.1,…
1
vote
1 answer

MLLib spark -ALStrainImplicit value more than 1

Experimenting with Spark mllib ALS("trainImplicit") for a while now. Would like to understand 1.why Im getting ratings value more than 1 in the predictions? 2.Is there any need for normalizing the user-product input? sample…
KRG
  • 19
  • 2