Questions tagged [apache-spark-mllib]

MLlib is a machine learning library for Apache Spark

MLlib is a low level, RDD based machine learning library for Apache Spark

External links:

Related tags:

,

2241 questions
1
vote
0 answers

Handling unseen categorical variables and MaxBins calculation in Spark Multiclass-classification

Below is the code I have for a RandomForest multiclass-classification model. I am reading from a CSV file and doing various transformations as seen in the code. I am calculating the max number of categories and then giving it as a parameter to RF.…
1
vote
1 answer

spark - MLlib: transform and manage categorical features

For big datasets with 2bil+ samples and approximately 100+ features per sample. Among these, 10% features you have are numerical/continuous variables and the rest of it are categorical variables (position, languages, url etc...). Let's use some…
1
vote
1 answer

pyspark Linear Regression Example from official documentation - Bad results?

I am planning to use Linear Regression in Spark. To get started, I checked out the example from the official documentation (which you can find here) I also found this question on stackoverflow, which is essentially the same question as mine. The…
Kito
  • 1,375
  • 4
  • 17
  • 37
1
vote
1 answer

How to use fixed-bin size in histogram in Hive?

I'm using Spark MLLib k-Means which requires features have same dimensions. The features are calculated using histogram, so I have to use fixed-size bins. Hive has a build-in function histogram_numeric(col, b) - Computes a histogram of a numeric…
wdz
  • 437
  • 1
  • 8
  • 18
1
vote
1 answer

Load RDD of sparse vectors from text file

I am working in the Scala Spark Shell and have the following RDD: scala> docsWithFeatures res10: org.apache.spark.rdd.RDD[(Long, org.apache.spark.mllib.linalg.Vector)] = MapPartitionsRDD[162] at repartition at :9 I previously saved this to…
moustachio
  • 2,924
  • 3
  • 36
  • 68
1
vote
1 answer

Finding top words per kmeans cluster

I have following section of code that maps the TFIDF for a collection of tweets onto original words, which are then used to find top words in each cluster: #document = sc.textFile("").map(lambda line: line.split(" ")) #"tfidf" is an…
adict11
  • 15
  • 7
1
vote
1 answer

Spark MLLib's LassoWithSGD doesn't scale?

I have code similar to what follows: val fileContent = sc.textFile("file:///myfile") val dataset = fileContent.map(row => { val explodedRow = row.split(",").map(s => s.toDouble) new LabeledPoint(explodedRow(13), Vectors.dense( …
1
vote
2 answers

How to print the probability of prediction in LogisticRegressionWithLBFGS for pyspark

I am using Spark 1.5.1 and, In pyspark, after I fit the model using: model = LogisticRegressionWithLBFGS.train(parsedData) I can print the prediction using: model.predict(p.features) Is there a function to print the probability score also along…
1
vote
1 answer

Cannot import name LDA MLlib in Spark

I'm trying to implement LDA using Spark and got this error. I'm totally new to Spark, so any help is appreciated. [root@sandbox ~]# spark-submit ./lda.py Traceback (most recent call last): File "/root/./lda.py", line 3, in from…
user1569341
  • 333
  • 1
  • 6
  • 17
1
vote
1 answer

Exception when trying to write a file to HDFS from Zeppelin

When trying to write to HDFS from Spark within Zeppelin, I am receiving this ClassNotFoundException for org.apache.hadoop.mapred.DirectFileOutputCommitter: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException:…
Greg
  • 557
  • 4
  • 20
1
vote
1 answer

Error thrown when using BlockMatrix.add

I'm attempting to use the distributed matrix data structure BlockMatrix (Spark 1.5.0, scala) and having some issues when attempting to add two block matrices together (error attached below). I'm constructing the two matrices by creating a collection…
1
vote
1 answer

Spark MlLib Frequent Pattern Mining, type parameter bounds

I have data in a key,value pairing the key is the column index and the value is whatever is in that columns value. My original file is just a csv. So I have the following: val myData = sc.textFile(file1) .map(x => x.split('|')) .flatMap(x =>…
theMadKing
  • 2,064
  • 7
  • 32
  • 59
1
vote
1 answer

Logistic regression training data set true/false ratio

I am working on a classifier, by logistic regression, based on Spark ML. and I wonder should I train the equal quantity of data for true , false. I mean When I want to classify people into male or female, Is it ok that train a model with 100 male…
1
vote
2 answers

Why is my Spark SVM always predicting the same label?

I'm having trouble getting my SVM to predict 0's and 1's where I would expect it to. It seems that after I train it and give it more data, it always wants to predict a 1 or a 0, but it will predict all 1's or all 0's, and never a mix of the two. …
Nathaniel
  • 540
  • 1
  • 7
  • 17
1
vote
1 answer

Error Loading mllib sample data into PySpark

Trying to load in some of the sample data into PySpark for Spark 1.3.0's MLlib example for RandomForests and am getting the errors below. I am new to MLlib and am uncertain how to examine this error further. Code:…
unique_beast
  • 1,379
  • 2
  • 11
  • 23