Highest Voted 'apache-spark-mllib' Questions

1

vote

0 answers

Handling unseen categorical variables and MaxBins calculation in Spark Multiclass-classification

Below is the code I have for a RandomForest multiclass-classification model. I am reading from a CSV file and doing various transformations as seen in the code. I am calculating the max number of categories and then giving it as a parameter to RF.…

scala apache-spark random-forest apache-spark-mllib apache-spark-ml

asked Dec 03 '15 at 18:11

Huga

571
1
8
21

1

vote

1 answer

spark - MLlib: transform and manage categorical features

For big datasets with 2bil+ samples and approximately 100+ features per sample. Among these, 10% features you have are numerical/continuous variables and the rest of it are categorical variables (position, languages, url etc...). Let's use some…

apache-spark pca feature-selection apache-spark-mllib svd

asked Nov 27 '15 at 17:24

Gianvi

21
3

1

vote

1 answer

pyspark Linear Regression Example from official documentation - Bad results?

I am planning to use Linear Regression in Spark. To get started, I checked out the example from the official documentation (which you can find here) I also found this question on stackoverflow, which is essentially the same question as mine. The…

python apache-spark linear-regression pyspark apache-spark-mllib

asked Nov 21 '15 at 11:50

Kito

1,375
4
17
37

1

vote

1 answer

How to use fixed-bin size in histogram in Hive?

I'm using Spark MLLib k-Means which requires features have same dimensions. The features are calculated using histogram, so I have to use fixed-size bins. Hive has a build-in function histogram_numeric(col, b) - Computes a histogram of a numeric…

scala apache-spark hive apache-spark-mllib

asked Nov 18 '15 at 19:16

wdz

437
1
8
18

1

vote

1 answer

Load RDD of sparse vectors from text file

I am working in the Scala Spark Shell and have the following RDD: scala> docsWithFeatures res10: org.apache.spark.rdd.RDD[(Long, org.apache.spark.mllib.linalg.Vector)] = MapPartitionsRDD[162] at repartition at :9 I previously saved this to…

scala apache-spark apache-spark-mllib

asked Nov 14 '15 at 16:08

moustachio

2,924
3
36
68

1

vote

1 answer

Finding top words per kmeans cluster

I have following section of code that maps the TFIDF for a collection of tweets onto original words, which are then used to find top words in each cluster: #document = sc.textFile("").map(lambda line: line.split(" ")) #"tfidf" is an…

python apache-spark pyspark apache-spark-mllib

asked Nov 08 '15 at 21:02

adict11

15
7

1

vote

1 answer

Spark MLLib's LassoWithSGD doesn't scale?

I have code similar to what follows: val fileContent = sc.textFile("file:///myfile") val dataset = fileContent.map(row => { val explodedRow = row.split(",").map(s => s.toDouble) new LabeledPoint(explodedRow(13), Vectors.dense( …

performance scala apache-spark apache-spark-mllib

asked Nov 08 '15 at 00:34

Karl Hall

13
2

1

vote

2 answers

How to print the probability of prediction in LogisticRegressionWithLBFGS for pyspark

I am using Spark 1.5.1 and, In pyspark, after I fit the model using: model = LogisticRegressionWithLBFGS.train(parsedData) I can print the prediction using: model.predict(p.features) Is there a function to print the probability score also along…

apache-spark machine-learning pyspark apache-spark-mllib logistic-regression

asked Nov 06 '15 at 06:33

dsmlaws

57
3
9

1

vote

1 answer

Cannot import name LDA MLlib in Spark

I'm trying to implement LDA using Spark and got this error. I'm totally new to Spark, so any help is appreciated. [root@sandbox ~]# spark-submit ./lda.py Traceback (most recent call last): File "/root/./lda.py", line 3, in from…

python apache-spark pyspark lda apache-spark-mllib

asked Nov 04 '15 at 22:38

user1569341

333
1
6
17

1

vote

1 answer

Exception when trying to write a file to HDFS from Zeppelin

When trying to write to HDFS from Spark within Zeppelin, I am receiving this ClassNotFoundException for org.apache.hadoop.mapred.DirectFileOutputCommitter: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException:…

apache-spark apache-zeppelin apache-spark-mllib

asked Nov 04 '15 at 21:49

Greg

557
4
20

1

vote

1 answer

Error thrown when using BlockMatrix.add

I'm attempting to use the distributed matrix data structure BlockMatrix (Spark 1.5.0, scala) and having some issues when attempting to add two block matrices together (error attached below). I'm constructing the two matrices by creating a collection…

scala apache-spark apache-spark-mllib

asked Nov 04 '15 at 17:43

Kareem Alhazred

11
2

1

vote

1 answer

Spark MlLib Frequent Pattern Mining, type parameter bounds

I have data in a key,value pairing the key is the column index and the value is whatever is in that columns value. My original file is just a csv. So I have the following: val myData = sc.textFile(file1) .map(x => x.split('|')) .flatMap(x =>…

scala apache-spark apache-spark-mllib

asked Nov 03 '15 at 23:10

theMadKing

2,064
7
32
59

1

vote

1 answer

Logistic regression training data set true/false ratio

I am working on a classifier, by logistic regression, based on Spark ML. and I wonder should I train the equal quantity of data for true , false. I mean When I want to classify people into male or female, Is it ok that train a model with 100 male…

machine-learning prediction logistic-regression apache-spark-mllib

asked Oct 31 '15 at 12:06

Jihun No

1,201
1
14
29

1

vote

2 answers

Why is my Spark SVM always predicting the same label?

I'm having trouble getting my SVM to predict 0's and 1's where I would expect it to. It seems that after I train it and give it more data, it always wants to predict a 1 or a 0, but it will predict all 1's or all 0's, and never a mix of the two. …

python apache-spark svm pyspark apache-spark-mllib

asked Oct 29 '15 at 20:40

Nathaniel

540
1
7
17

1

vote

1 answer

Error Loading mllib sample data into PySpark

Trying to load in some of the sample data into PySpark for Spark 1.3.0's MLlib example for RandomForests and am getting the errors below. I am new to MLlib and am uncertain how to examine this error further. Code:…

apache-spark pyspark apache-spark-mllib

asked Oct 28 '15 at 16:09

unique_beast

1,379
2
11
23

Questions tagged [apache-spark-mllib]