Highest Voted 'apache-spark-mllib' Questions

1

vote

1 answer

how to convert a PythonRDD with sparse data into dense PythonRDD

I want to use StandardScaler to scale the data. I've loaded the data into a PythonRDD. It seems the data is sparse. To apply StandardScaler, we should first convert it into dense types. trainData = MLUtils.loadLibSVMFile(sc, trainDataPath) valData…

python apache-spark pyspark apache-spark-mllib

asked May 21 '16 at 04:26

mining

3,557
5
39
66

1

vote

1 answer

Best way to build LabeledPoint of Features for Apache Spark MLlib in Java

I am preparing Data that contains Ids (Labels) and Keywords (Features) to pass them to MLlib algorithms, in Java. My keywords are strings separated with commas. My goal is to use multiclass classification algorithms to predict the id. The question…

apache-spark machine-learning apache-spark-mllib

asked May 20 '16 at 07:53

Sparkan

139
1
13

1

vote

2 answers

finding Bigrams in spark with java(8)

I have tokenized the sentences into word RDD. so now i need Bigrams. ex. This is my test => (This is), (is my), (my test) I have search thru and found .sliding operator for this purpose. But I'm not getting this option on my eclipse (may it is…

java apache-spark java-8 apache-spark-mllib

asked May 18 '16 at 11:38

insomniac

155
2
15

1

vote

1 answer

The argument types of an anonymous function must be fully known. (SLS 8.5) when word2vec applied for dataframe

I apply Spark's word2vec by using a dataframe. Here is my code: val df2 = df.groupBy("LABEL").agg(collect_list("TERM").alias("TERM")) val word2Vec = new Word2Vec() .setInputCol("TERM") .setOutputCol("result") …

scala apache-spark dataframe apache-spark-mllib

asked May 13 '16 at 09:32

mlee_jordan

772
4
18
50

1

vote

3 answers

Training Sparks word2vec with a RDD[String]

I'm new to Spark and Scala so I might have misunderstood some basic things here. I'm trying to train Sparks word2vec model on my own data. According to their documentation, one way to do this is val input = sc.textFile("text8").map(line =>…

scala apache-spark apache-spark-mllib apache-spark-sql

asked May 11 '16 at 19:42

burk

345
4
16

1

vote

1 answer

/usr/bin/time CPU utilization against TOP while using SPARK

I ran a SVM algorithm using MLIB library in Spark on a data of size 8G, and 7 million rows. I am running Spark in standalone mode on a single node. I used /usr/bin/time -v to capture data about the job. I got the peak memory utilization, and % CPU…

linux hadoop apache-spark linux-kernel apache-spark-mllib

asked Apr 30 '16 at 02:33

Testing123

363
2
12

1

vote

0 answers

true negative is 0% whereas true positive is 100% correctly classified

I used Naive Bayes from Spark's MlLib to train a model and test it on the data (in the form of an RDD). The results were confusing. the data and results are as follows: The problem is a binary classification one. The outcome should be either a label…

machine-learning pyspark apache-spark-mllib naivebayes

asked Apr 29 '16 at 23:58

preetham madeti

309
1
4
10

1

vote

0 answers

Streaming Clustering with Unknown Number of Clusters

I need to classify a number of data points that will arrive in time. Streaming K-Means would be fine, if I only knew how many different classes (clusters) I might find on my data points. Is there any way to use Spark MLlib 'out of the box' to run a…

machine-learning cluster-analysis spark-streaming apache-spark-mllib hierarchical-clustering

asked Apr 29 '16 at 04:51

user1478550

46
2

1

vote

0 answers

Elegant way to unfold RDD[LablePoint] to DataFrame something like RowMatrix

I have libsvm with (lable, sparse Vector) which I can load to RDD[LablePoint]. I am wondering if there is an elegant way to convert to a dataframe that holds unfolded something like RowMatrix in scala.

apache-spark apache-spark-sql apache-spark-mllib

asked Apr 25 '16 at 14:33

NotAnExpert

11
1

1

vote

0 answers

Spark mllib Collaborative Filtering, ValueError: RDD is empty

I'm new to Spark and am running the implicit collaborative fitering from here mllib. When I run the following code on my data, I'm getting the following error: ValueError: RDD is empty Here is my data: 101,1000010,1 101,1000011,1 …

apache-spark pyspark apache-spark-mllib

asked Apr 24 '16 at 18:42

jKraut

2,325
6
35
48

1

vote

1 answer

Spark Latent Dirichlet Allocation model topic matrix is too small

First, just in case, I will explain how I represented the documents that I want to run the LDA model on. First, I do some preprocessing to get the most important terms per a person for all their documents, then I get the union of all the most…

scala apache-spark machine-learning lda apache-spark-mllib

asked Apr 21 '16 at 03:18

Jake Fund

395
2
6
16

1

vote

0 answers

NaN in input vector for MLlib algorithms

I want to cluster my data using spark's MLlib functions. The problem is that in my dataset sometimes I get NULL as the features value. I can't write 0.0 instead of it since it's just wrong. So I tried using Double.NaN for the value. This doesn't…

java scala apache-spark apache-spark-mllib

asked Apr 20 '16 at 13:11

antonpuz

3,256
4
25
48

1

vote

1 answer

Apache Spark TFIDF using Python

The Spark documentation states to use HashingTF feature, but I'm unsure what the transform function expects as input. http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf I tried running the tutorial code: from pyspark import…

python apache-spark pyspark apache-spark-mllib

asked Apr 02 '16 at 16:55

user2388191

11
3

1

vote

0 answers

More information on Clusters generated using K-Means Clustering algorithm in Spark MLIB

I'm seeking help to know more information on the clusters generated using K-Means clustering algorithm in Spark MLIB. By the end of the below code snippet, we have a K-Means Model in the value clusters. val data = List((0.0, 0.0, 0.0),(0.1, 0.1,…

scala apache-spark apache-spark-mllib

asked Apr 01 '16 at 22:18

Pavan Yadiki

41
5

1

vote

1 answer

MLLib spark -ALStrainImplicit value more than 1

Experimenting with Spark mllib ALS("trainImplicit") for a while now. Would like to understand 1.why Im getting ratings value more than 1 in the predictions? 2.Is there any need for normalizing the user-product input? sample…

apache-spark apache-spark-mllib

asked Mar 29 '16 at 04:41

KRG

19
2

Questions tagged [apache-spark-mllib]